We need to check what component are started and what are pending based on that we need to take approach .
Known Issues / Issues faced :
1) ora.crsd remains in INTERMEDIATE and commands like crsctl hang and Logical corruption check failed (Doc ID 2075966.1)
2) CRSD Failing to Start or in INTERMEDIATE State Due to OLR corruption. (Doc ID 2809968.1)
3) CRS is not starting up on the cluster node (Doc ID 2445492.1) -- Missing execute permission on <GRID_HOME>/srvm/mesg directory
4) Crs activeversion is different on both nodes after patching/upgrade
5) Permission of Voting/Ocr disk was changed
6) ohasd process was no started due to missing /etc/systemd/system/ohasd.service
7) Asm startup issues faced due to below
--> asm disk were not shared between nodes
--> asm_diskstring parameters were somehow removed
7) asm not started due to afd filter issue . failed to access AFD label disk after reboot
This has been documented in my another Blog below
https://abdul-hafeez-kalsekar.blogspot.com/2021/10/how-to-install-and-configure-asm-filter.html
Solutions We can try wherever applicable :
1) If Ohas process is started already , try to start asm manually which should bring up crs
2) Start crsd process alone :
crsctl check crs
crsctl check cluster -all
crsctl start res ora.crsd -init
or
Execute /etc/init.d/init.crs start
3) Start ohasd manually . :
nohup sh /etc/init.d/init.ohasd run &
ps -ef | grep -i d.bin
ps -ef | grep -i ohasd
RHEL 7 onwards, it uses systemd rather than initd for starting or restarting processes and runs them as a service.
[root@pmyws01 ~]# cat /etc/systemd/system/ohasd.service
[Unit]
Description=Oracle High Availability Services
After=syslog.target
[Service]
ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Restart=always
[Install]
WantedBy=multi-user.target
[root@pmyws01 ~]# systemctl daemon-reload
[root@pmyws01 ~]# systemctl enable ohasd.service
[root@pmyws01 ~]# systemctl start ohasd.service
[root@pmyws01 ~]# systemctl status ohasd.service
4) Try running /u01/app/19.0.0/grid/crs/install/rootcrs.sh -postpatch
Handy commands
crsctl get cluster name
crsctl get cluster configuration
crsctl enable crs
crsctl disable crs
crsctl config crs
crsctl query css votedisk
crsctl start cluster -all | -n nodename
crsctl stop cluster -all | -n nodename
crsctl start cluster
crsctl check cluster -all | -n nodename
crsctl query crs activeversion
crsctl query crs activeversion -f
crsctl query crs softwareversion
crsctl query crs softwarepatch
crsctl start crs [-excl [-nocrs] [-cssonly]] | [-wait | -waithas | -nowait] | [-noautostart]
crsctl start res ora.crsd -init
crsctl start crs -excl -nocrs
crsctl start crs -wait
crsctl stop rollingpatch
$GRID_HOME/bin/kfod op=patches
/u01/grid/19.0.0.0/bin/srvctl status cha
crsctl modify resource ora.chad -attr "ENABLED=1" -unsupported (to enable cha if 'ora.chad' is disabled )
grep "OCR MASTER" $GRID_HOME/log/<nodename>/crsd/crsd.log
Must Read : Top 5 Grid Infrastructure Startup Issues (Doc ID 1368382.1)
Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is running but no init.ohasd or other processes
Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not running
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running
Issue #5: ASM instance does not start, ora.asm is OFFLINE
Logs to Check :
1) Cluster Alert Log
2) deamon logs
3) os messages log
4) oragent log
Most clusterware daemons/processes pid and output file are in <ORACLE_BASE>/crsdata/<node>/output
Most clusterware daemons/processes logs are in <ORACLE_BASE>/diag/crs/<node>/crs/trace
References :
1) Troubleshooting CRSD Start up Issue (Doc ID 1323698.1)
2) Redaht7/Oracle Linux7 + ORA11g : ohasd fails to start(Doc ID 1959008.1)
3) Troubleshoot Grid Infrastructure Startup Issues (Doc ID 1050908.1)
4) How To Gather & Backup ASM/ACFS Metadata In A Formatted Manner version 10.1, 10.2, 11.1, 11.2, 12.1, 12.2, 18.x and 19.x (Doc ID 470211.1)
Case Study 1: OHASD does not start
As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
Automatic ohasd.bin start up depends on the following:
1. OS is at appropriate run level:
OS need to be at specified run level before CRS will try to start up.
To find out at which run level the clusterware needs to come up:
cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up. Oracle Linux 7 (and Red Hat Linux 7) uses systemd to manage start/stop services (example: /etc/systemd/system/oracle-ohasd.service)
Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.
To find out current run level:
who -r
OHASD.BIN will spawn four agents/monitors to start resource:
oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)
If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy
Case Study 2: . OLR is corrupted
In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):
2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [ CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].
2012-07-22 00:15:16.585: [ CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2012-07-22 00:15:16.585: [ CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [ CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [ CTSS][1]CTSS daemon aborting
The solution is to restore a good copy of OLR note 1193643.1
. Enable further tracing
Execute the following steps as root:
B1. List all crsd modules:
# $GRID_HOME/bin/crsctl lsmodules crs
B2. Find out the current trace level for all crsd modules - the output can be used to to revert back to original trace level once the issue is solved:
# $GRID_HOME/bin/crsctl get log crs all
B3. Set trace level to 5 for all modules:
# $GRID_HOME/bin/crsctl set log crs all:5
Alternatively trace level for each modules can be set individually:
# $GRID_HOME/bin/crsctl set log crs CRSPE:5
Note: The module name is case sensitive
B4. Once the issue is solved, set trace level back to original value for all crs modules with output from B2
# $GRID_HOME/bin/crsctl set log crs AGENT:1
Take pstack
When crsd.bin dumps, it will store a thread stack in $GRID_HOME/log/<nodename>/crsd/crsdOUT.log
Most of the time, there may not be a time window to take pstack as crsd.bin aborts very quick; but in case crsd.bin stays up for a short while, take a few pstack at interval of 30 seconds as root:
Find out pid of crsd.bin: ps -ef| grep crsd.bin
Take pstack:
AIX : /bin/procstack <pid-of-crsd.bin>
Linux : /usr/bin/pstack <pid-of-crsd.bin>
Solaris : /usr/bin/pstack <pid-of-crsd.bin>
No comments:
Post a Comment