We had one activity planned which was related to windows hostname and SQL server name change.
DB Servers Setup Information:
- It was a 2 node windows cluster, Node1 & Node2. SQL server Always On was configured on this setup.
- 4 Synchronous Always On Groups (AG1, AG2,AG3,AG4) were setup and Node1 was acting as Primary.
- Always ON status was healthy and in sync without any issues.
- Both the nodes were on SQL Server 2017 RTM Version – 14.0.1000.169
As part of activity, we needed to failover all 4 AG Groups from Node1 to Node2. It happened smoothly for 3 AG Groups (AG2, AG3, AG4). However, we were not able to failover AG Group AG1 from Node1 to Node 2. Multiple attempts failed. We could not find any errors in SQL server error logs and event viewer. Due to limited downtime, we had to abandon the activity.
Now the shift was on diagnosing the issue.
- We found below messages in cluster log. Please note XXX.XX.XX.XXX is AG listener IP of AG Group AG1
[NETFTAPI] received NsiDeleteInstance for XXX.XX.XX.XXX
0000175c.00003b14::2021/04/14-03:07:50.565 WARN [NETFTAPI] Failed to query parameters for XXX.XX.XX.XXX (status 0x80070490)
0000175c.00003b14::2021/04/14-03:07:50.565 DBG [NETFTAPI] Signaled NetftLocalRemove event for XXX.XX.XX.XXX
0000175c.0000266c::2021/04/14-03:07:51.512 INFO [GEM] Node 2: Processing message as part of GemRepair message 1:37836 from node 1. Action: causal, Target: CAUS
0000175c.0000266c::2021/04/14-03:07:51.512 INFO [GUM] Node 2: Processing RequestLock 1:10588965
We involved windows and network team as well. They could not find any issues.
- We observed that AG group AG1 was part of one Distributed Availability Group(DAG). There was one more 2 node windows cluster (Node3 & Node4) having Always ON Groups configured. Node3 and Node4 had AG Group Named AG5. DAG was configured between AG5(on Node3 & Node 4) and AG1(on Node1 & Node 2). AG1 was acting as secondary replica in DAG. We explored
few articles, but none of those pointed towards any issues with this setup being an obstacle in failover of AG1.
While exploring, we found a Microsoft article, which suggested that it was a bug and we should apply latest patch CU23 on both the server. As a result, the issue was resolved.
Please refer below link:
Below Queries could be of use to have a look at existing AG setup;
SELECT r.replica_server_name, r.endpoint_url,
rs.connected_state_desc, rs.role_desc, rs.operational_state_desc,
rs.recovery_health_desc,rs.synchronization_health_desc,
r.availability_mode_desc, r.failover_mode_desc
FROM sys.dm_hadr_availability_replica_states rs
INNER JOIN sys.availability_replicas r
ON rs.replica_id=r.replica_id
ORDER BY r.replica_server_name
GO
SELECT ag.name, ag.is_distributed, ar.replica_server_name, ar.availability_mode_desc, ars.connected_state_desc, ars.role_desc,
ars.operational_state_desc, ars.synchronization_health_desc FROM sys.availability_groups ag
JOIN sys.availability_replicas ar on ag.group_id=ar.group_id
LEFT JOIN sys.dm_hadr_availability_replica_states ars
ON ars.replica_id=ar.replica_id
WHERE ag.is_distributed=1
GO
SELECT ag.name
, drs.database_id
, drs.group_id
, drs.replica_id
, drs.synchronization_state_desc
, drs.end_of_log_lsn
FROM sys.dm_hadr_database_replica_states drs,
sys.availability_groups ag
WHERE drs.group_id = ag.group_id
SELECT ag.[name] AS [AG Name], ag.Is_Distributed,
ar.replica_server_name AS [Replica Name]
FROM sys.availability_groups AS ag INNERJOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id
GO
SELECT ag.[name] AS [AG Name], ag.is_distributed, ar.replica_server_name AS [Underlying AG], ars.role_desc AS [Role], ars.synchronization_health_desc AS [SyncStatus]
FROM sys.availability_groups AS ag INNERJOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id INNERJOIN sys.dm_hadr_availability_replica_states AS ars ON ar.replica_id = ars.replica_id WHERE ag.is_distributed = 1
GO
Happy Learning !