Tuesday, September 27, 2022

Troubleshooting Oracle Rac Crs/Cluster Startup Issues


We need to check what  component  are started and what are pending based on that we need to  take approach . 




Known Issues  / Issues faced : 

1)  ora.crsd remains in INTERMEDIATE and commands like crsctl hang and Logical corruption check failed (Doc ID 2075966.1)

2) CRSD Failing to Start or in INTERMEDIATE State Due to OLR corruption. (Doc ID 2809968.1) 

3) CRS is not starting up on the cluster node (Doc ID 2445492.1)  --  Missing execute permission  on <GRID_HOME>/srvm/mesg directory

4)  Crs activeversion is different on both nodes after patching/upgrade 

5)  Permission of Voting/Ocr disk was changed 

6)  ohasd process was no started due to missing /etc/systemd/system/ohasd.service 

7)  Asm startup issues faced  due to  below 

-->  asm disk were not shared between nodes 
--> asm_diskstring parameters were somehow removed 



7) asm not started due to afd filter  issue  . failed to access AFD label disk  after reboot 
    This has been documented in my another Blog  below 

https://abdul-hafeez-kalsekar.blogspot.com/2021/10/how-to-install-and-configure-asm-filter.html







Solutions We can try wherever applicable  : 

1) If Ohas process is started   already , try to start asm manually    which should bring up crs 


2)  Start  crsd process alone  : 

crsctl check crs
crsctl check cluster -all
crsctl start res ora.crsd -init
or
Execute /etc/init.d/init.crs start



3) Start ohasd  manually .   : 

nohup  sh  /etc/init.d/init.ohasd run  & 
ps -ef | grep -i d.bin 
ps -ef | grep -i ohasd 

RHEL 7 onwards, it uses systemd rather than initd for starting or restarting processes and runs them as a service.
[root@pmyws01 ~]# cat /etc/systemd/system/ohasd.service
[Unit]
Description=Oracle High Availability Services
After=syslog.target

[Service]
ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
Restart=always

[Install]
WantedBy=multi-user.target
[root@pmyws01 ~]# systemctl daemon-reload
[root@pmyws01 ~]# systemctl enable ohasd.service
[root@pmyws01 ~]# systemctl start ohasd.service
[root@pmyws01 ~]# systemctl status ohasd.service



4) Try running   /u01/app/19.0.0/grid/crs/install/rootcrs.sh -postpatch 





      Handy  commands

     crsctl get cluster name
     crsctl get cluster configuration
     crsctl enable crs  
     crsctl disable crs
     crsctl config crs
     crsctl query css votedisk
     crsctl start cluster -all | -n nodename
     crsctl stop cluster -all | -n nodename
     crsctl start cluster 
     crsctl check cluster -all | -n nodename
     crsctl query crs activeversion
     crsctl query crs activeversion -f
     crsctl query crs softwareversion 
      crsctl query crs softwarepatch
crsctl start crs [-excl [-nocrs] [-cssonly]] | [-wait | -waithas | -nowait] | [-noautostart]
crsctl start res ora.crsd -init
crsctl start crs -excl -nocrs
crsctl start crs -wait
crsctl stop rollingpatch
$GRID_HOME/bin/kfod op=patches

/u01/grid/19.0.0.0/bin/srvctl status cha
crsctl modify resource ora.chad -attr "ENABLED=1" -unsupported      (to enable cha if  'ora.chad' is disabled ) 

grep "OCR MASTER" $GRID_HOME/log/<nodename>/crsd/crsd.log





Must Read  : Top 5 Grid Infrastructure Startup Issues (Doc ID 1368382.1)

Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is running but no init.ohasd or other processes
Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not running
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running
Issue #5: ASM instance does not start, ora.asm is OFFLINE
 



Logs to Check :

1) Cluster Alert Log 
2)  deamon  logs 
3) os messages log 
4) oragent log 

Most clusterware daemons/processes pid and output file are in <ORACLE_BASE>/crsdata/<node>/output

Most clusterware daemons/processes logs are in <ORACLE_BASE>/diag/crs/<node>/crs/trace




References : 

1) Troubleshooting CRSD Start up Issue (Doc ID 1323698.1)
2)  Redaht7/Oracle Linux7 + ORA11g : ohasd fails to start(Doc ID 1959008.1)
3) Troubleshoot Grid Infrastructure Startup Issues (Doc ID 1050908.1)
4) How To Gather & Backup ASM/ACFS Metadata In A Formatted Manner version 10.1, 10.2, 11.1, 11.2, 12.1, 12.2, 18.x and 19.x (Doc ID 470211.1)






Case Study 1: OHASD does not start
 
As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:
 
 
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
 
 
Automatic ohasd.bin start up depends on the following:
 
1. OS is at appropriate run level:
 
OS need to be at specified run level before CRS will try to start up.
 
To find out at which run level the clusterware needs to come up:
 
cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up. Oracle Linux 7 (and Red Hat Linux 7) uses systemd to manage start/stop services (example: /etc/systemd/system/oracle-ohasd.service)
 
Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.
 
To find out current run level:
 
who -r


OHASD.BIN will spawn four agents/monitors to start resource:
 
  oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
  orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
  cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)
 
If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy 





Case Study 2: . OLR is corrupted
 
In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):
 
2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [    CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [    CTSS][1]CTSS daemon aborting
 
 
 
The solution is to restore a good copy of OLR note 1193643.1   
 






. Enable further tracing

Execute the following steps as root:

B1. List all crsd modules:

# $GRID_HOME/bin/crsctl lsmodules crs


B2. Find out the current trace level for all crsd modules - the output can be used to to revert back to original trace level once the issue is solved:

# $GRID_HOME/bin/crsctl get log crs all


B3. Set trace level to 5 for all modules:

# $GRID_HOME/bin/crsctl set log crs all:5



Alternatively trace level for each modules can be set individually:


# $GRID_HOME/bin/crsctl set log crs CRSPE:5

Note: The module name is case sensitive


B4. Once the issue is solved, set trace level back to original value for all crs modules with output from B2

# $GRID_HOME/bin/crsctl set log crs AGENT:1




Take pstack

When crsd.bin dumps, it will store a thread stack in $GRID_HOME/log/<nodename>/crsd/crsdOUT.log

Most of the time, there may not be a time window to take pstack as crsd.bin aborts very quick; but in case crsd.bin stays up for a short while, take a few pstack at interval of 30 seconds as root:

Find out pid of crsd.bin: ps -ef| grep crsd.bin

Take pstack:

  AIX        : /bin/procstack <pid-of-crsd.bin>
  Linux        : /usr/bin/pstack <pid-of-crsd.bin>
  Solaris    : /usr/bin/pstack <pid-of-crsd.bin>





Sunday, September 4, 2022

Starting Oracle Cluster Health Advisor (CHA) and CRSD manually



It just that we come across situation where we see some of rac process down .Below are  2 of process that can be started manually .  



Starting CHA  manually 

chad stands for the Cluster Health Advisor (CHA) daemon which is is part of the Oracle Autonomous Health Framework(AHF), 
It continuously monitors cluster nodes and Oracle RAC databases for performance and availability issues.

Oracle Cluster Health Advisor runs as a highly available cluster resource, ochad, on each node in the cluster. Each Oracle Cluster Health Advisor daemon (ochad) monitors the operating system on the cluster node and optionally, each Oracle Real Application Clusters (Oracle RAC) database instance on the node.

 The ochad daemon receives operating system metric data from the Cluster Health Monitor and gets Oracle RAC database instance metrics from a memory-mapped file. The daemon does not require a connection to each database instance. This data, along with the selected model, is used in the Health Prognostics Engine of Oracle Cluster Health Advisor for both the node and each monitored database instance in order to analyze their health multiple times a minute.
 

It is sometimes found that  CHA process is down and we have to start it manually 

crsctl stat res -t  
crsctl status res ora.chad
crsctl stat res ora.chad -t 
srvctl start cha
chactl status 
srvctl status cha 




Starting Crsd  Manually : 
 

$GRID_HOME/bin/crsctl stat res -t -init
$GRID_HOME/bin/crsctl start res ora.crsd -init



If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:
 
 
2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.
..
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai
 
The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd



Reference : 

Troubleshooting CRSD Start up Issue (Doc ID 1323698.1)