Abdul hafeez kalsekar -- OCS/OCE/OCP: September 2022

We need to check what component are started and what are pending based on that we need to take approach .

Known Issues / Issues faced :

1) ora.crsd remains in INTERMEDIATE and commands like crsctl hang and Logical corruption check failed (Doc ID 2075966.1)

2) CRSD Failing to Start or in INTERMEDIATE State Due to OLR corruption. (Doc ID 2809968.1)

3) CRS is not starting up on the cluster node (Doc ID 2445492.1) -- Missing execute permission on <GRID_HOME>/srvm/mesg directory

4) Crs activeversion is different on both nodes after patching/upgrade

5) Permission of Voting/Ocr disk was changed

6) ohasd process was no started due to missing /etc/systemd/system/ohasd.service

7) Asm startup issues faced due to below

--> asm disk were not shared between nodes

--> asm_diskstring parameters were somehow removed

7) asm not started due to afd filter issue . failed to access AFD label disk after reboot

This has been documented in my another Blog below

https://abdul-hafeez-kalsekar.blogspot.com/2021/10/how-to-install-and-configure-asm-filter.html

Solutions We can try wherever applicable :

1) If Ohas process is started already , try to start asm manually which should bring up crs

2) Start crsd process alone :

crsctl check crs

crsctl check cluster -all

crsctl start res ora.crsd -init

Execute /etc/init.d/init.crs start

3) Start ohasd manually . :

nohup sh /etc/init.d/init.ohasd run &

ps -ef | grep -i d.bin

ps -ef | grep -i ohasd

RHEL 7 onwards, it uses systemd rather than initd for starting or restarting processes and runs them as a service.

[root@pmyws01 ~]# cat /etc/systemd/system/ohasd.service

[Unit]

Description=Oracle High Availability Services

After=syslog.target

[Service]

ExecStart=/etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple

Restart=always

[Install]

WantedBy=multi-user.target

[root@pmyws01 ~]# systemctl daemon-reload

[root@pmyws01 ~]# systemctl enable ohasd.service

[root@pmyws01 ~]# systemctl start ohasd.service

[root@pmyws01 ~]# systemctl status ohasd.service

4) Try running /u01/app/19.0.0/grid/crs/install/rootcrs.sh -postpatch

Handy commands

crsctl get cluster name

crsctl get cluster configuration

crsctl enable crs

crsctl disable crs

crsctl config crs

crsctl query css votedisk

crsctl start cluster -all | -n nodename

crsctl stop cluster -all | -n nodename

crsctl start cluster

crsctl check cluster -all | -n nodename

crsctl query crs activeversion

crsctl query crs activeversion -f

crsctl query crs softwareversion

crsctl query crs softwarepatch

crsctl start crs [-excl [-nocrs] [-cssonly]] | [-wait | -waithas | -nowait] | [-noautostart]

crsctl start res ora.crsd -init

crsctl start crs -excl -nocrs

crsctl start crs -wait

crsctl stop rollingpatch

$GRID_HOME/bin/kfod op=patches

/u01/grid/19.0.0.0/bin/srvctl status cha

crsctl modify resource ora.chad -attr "ENABLED=1" -unsupported (to enable cha if 'ora.chad' is disabled )

grep "OCR MASTER" $GRID_HOME/log/<nodename>/crsd/crsd.log

Must Read : Top 5 Grid Infrastructure Startup Issues (Doc ID 1368382.1)

Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is running but no init.ohasd or other processes

Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not running

Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running

Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running

Issue #5: ASM instance does not start, ora.asm is OFFLINE

Logs to Check :

1) Cluster Alert Log

2) deamon logs

3) os messages log

4) oragent log

Most clusterware daemons/processes pid and output file are in <ORACLE_BASE>/crsdata/<node>/output

Most clusterware daemons/processes logs are in <ORACLE_BASE>/diag/crs/<node>/crs/trace

References :

1) Troubleshooting CRSD Start up Issue (Doc ID 1323698.1)

2) Redaht7/Oracle Linux7 + ORA11g : ohasd fails to start(Doc ID 1959008.1)

3) Troubleshoot Grid Infrastructure Startup Issues (Doc ID 1050908.1)

4) How To Gather & Backup ASM/ACFS Metadata In A Formatted Manner version 10.1, 10.2, 11.1, 11.2, 12.1, 12.2, 18.x and 19.x (Doc ID 470211.1)

Case Study 1: OHASD does not start

As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:

CRS-4124: Oracle High Availability Services startup failed.

CRS-4000: Command Start failed, or completed with errors.

Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

cat /etc/inittab|grep init.ohasd

h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up. Oracle Linux 7 (and Red Hat Linux 7) uses systemd to manage start/stop services (example: /etc/systemd/system/oracle-ohasd.service)

Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:

who -r

OHASD.BIN will spawn four agents/monitors to start resource:

oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc

orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc

cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy

Case Study 2: . OLR is corrupted

In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):

2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.

2012-07-22 00:15:16.575: [ CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].

2012-07-22 00:15:16.585: [ CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].

2012-07-22 00:15:16.585: [ CTSS][1]ctss_main: CTSS init failed [19]

2012-07-22 00:15:16.585: [ CTSS][1]ctss_main: CTSS daemon aborting [19].

2012-07-22 00:15:16.585: [ CTSS][1]CTSS daemon aborting

The solution is to restore a good copy of OLR note 1193643.1

. Enable further tracing

Execute the following steps as root:

B1. List all crsd modules:

# $GRID_HOME/bin/crsctl lsmodules crs

B2. Find out the current trace level for all crsd modules - the output can be used to to revert back to original trace level once the issue is solved:

# $GRID_HOME/bin/crsctl get log crs all

B3. Set trace level to 5 for all modules:

# $GRID_HOME/bin/crsctl set log crs all:5

Alternatively trace level for each modules can be set individually:

# $GRID_HOME/bin/crsctl set log crs CRSPE:5

Note: The module name is case sensitive

B4. Once the issue is solved, set trace level back to original value for all crs modules with output from B2

# $GRID_HOME/bin/crsctl set log crs AGENT:1

Take pstack

When crsd.bin dumps, it will store a thread stack in $GRID_HOME/log/<nodename>/crsd/crsdOUT.log

Most of the time, there may not be a time window to take pstack as crsd.bin aborts very quick; but in case crsd.bin stays up for a short while, take a few pstack at interval of 30 seconds as root:

Find out pid of crsd.bin: ps -ef| grep crsd.bin

Take pstack:

AIX : /bin/procstack <pid-of-crsd.bin>

Linux : /usr/bin/pstack <pid-of-crsd.bin>

Solaris : /usr/bin/pstack <pid-of-crsd.bin>

It just that we come across situation where we see some of rac process down .Below are 2 of process that can be started manually .

Starting CHA manually

chad stands for the Cluster Health Advisor (CHA) daemon which is is part of the Oracle Autonomous Health Framework(AHF),

It continuously monitors cluster nodes and Oracle RAC databases for performance and availability issues.

Oracle Cluster Health Advisor runs as a highly available cluster resource, ochad, on each node in the cluster. Each Oracle Cluster Health Advisor daemon (ochad) monitors the operating system on the cluster node and optionally, each Oracle Real Application Clusters (Oracle RAC) database instance on the node.

The ochad daemon receives operating system metric data from the Cluster Health Monitor and gets Oracle RAC database instance metrics from a memory-mapped file. The daemon does not require a connection to each database instance. This data, along with the selected model, is used in the Health Prognostics Engine of Oracle Cluster Health Advisor for both the node and each monitored database instance in order to analyze their health multiple times a minute.

It is sometimes found that CHA process is down and we have to start it manually

crsctl stat res -t

crsctl status res ora.chad

crsctl stat res ora.chad -t

srvctl start cha

chactl status

srvctl status cha

Starting Crsd Manually :

$GRID_HOME/bin/crsctl stat res -t -init

$GRID_HOME/bin/crsctl start res ora.crsd -init

If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Failed to record pid for CRSD

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Terminating process

2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai

The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd

Reference :

Troubleshooting CRSD Start up Issue (Doc ID 1323698.1)

Abdul hafeez kalsekar -- OCS/OCE/OCP

Tuesday, September 27, 2022

Troubleshooting Oracle Rac Crs/Cluster Startup Issues

Sunday, September 4, 2022

Starting Oracle Cluster Health Advisor (CHA) and CRSD manually