Issue
Application Team complained about latency in connecting to database through listener name. They also complained about application taking more time to connect to database after Always ON Group failover. Sometimes they observed application timeouts as well.
- Underlying database was part of Multi-Subnet Always ON SQL Server.
- Application was connecting to database using Listener IP. As it was Multi-Subnet database, Listener was having two IPs. One will remain online and another will remain offline depending upon which subnet was hosting the Always On Group at a particular time.
Observation and Analysis
We ran below PowerShell commands to have a look at all the Cluster resources. In this example :
Always ON group Name : AOAGTestCorpSQL Always ON Listener Name : TestcoreListen1 Assigned IPs to Listener : AAA.AA.AA.AAA (Currently Online ) , BBB.BB.BB.BBB (Currently Offline) Please note green highlighted Network Name, this would be of great use in fixing issue.
PS C:\Windows\system32> Import-Module FailoverClusters
PS C:\Windows\system32> Get-ClusterResource
Name State OwnerGroup ResourceType
---- ----- ---------- ------------
AOAGTestCorpSQL Online AOAGTestCorpSQL SQL Server Availability Group
AOAGTestCorpSQL_AAA.AA.AA.AAA Online AOAGTestCorpSQL IP Address
AOAGTestCorpSQL_BBB.BB.BB.BBB Offline AOAGTestCorpSQL IP Address
AOAGTestCorpSQL_TestcoreListen1 Online AOAGTestCorpSQL Network Name
Cluster IP Address Online Cluster Group IP Address
Cluster IP Address DDD.DD.DD.DD Offline Cluster Group IP Address
Cluster Name Online Cluster Group Network Name
File Share Witness Online Cluster Group File Share Witness
Everything looked good at database end and it seemed to be more of a configuration issue.
We ran below PowerShell command to see what DNS returns for Always ON Listener Name :
PS C:\Windows\system32> nslookup Testcorelisten1
Server: ActiveHostname.ec2.internal
Address: ActiveNodeIP
Name: Testcorelisten1.bhzTestcoreprod.local
Address: AAA.AA.AA.AAA
Address: BBB.BB.BB.BBB
We observed that both online and offline IPs were being retuned from DNS. We checked this from application server as well as DB server and got same result. While doing Test Connection from application server to database, sometime offline IP was being tried first and online IP later. This IP reporting looked completely random. This was resulting in latency in making connections.
Solution 1:
We asked for connection string from application team and observed that they were not using MultiSubnetFailover=true . This should be used for multi subnet setups. However it was not possible for them as it was a legacy application. Sometime making changes at application end is not desired by customer.
Solution 2:
We ran below PowerShell command to see values for two important highlighted settings:
PS C:\Windows\system32> Get-ClusterResource -Name AOAGTestCorpSQL_TestcoreListen1 | Get-ClusterParameter
Object Name Value Type
------ ---- ----- ----
AOAGTestCorpSQL_TestcoreListen1 Name TestcoreLISTEN1 String
AOAGTestCorpSQL_TestcoreListen1 DnsName TestcoreListen1 String
AOAGTestCorpSQL_TestcoreListen1 Aliases String
AOAGTestCorpSQL_TestcoreListen1 RemapPipeNames 1 UInt32
AOAGTestCorpSQL_TestcoreListen1 HostRecordTTL 1200 UInt32
AOAGTestCorpSQL_TestcoreListen1 RegisterAllProvidersIP 1 UInt32
AOAGTestCorpSQL_TestcoreListen1 PublishPTRRecords 0 UInt32
AOAGTestCorpSQL_TestcoreListen1 ADAware 1 UInt32
AOAGTestCorpSQL_TestcoreListen1 ResourceData {1, 0, 0, 0...} ByteArray
AOAGTestCorpSQL_TestcoreListen1 StatusNetBIOS 0 UInt32
AOAGTestCorpSQL_TestcoreListen1 StatusDNS 0 UInt32
AOAGTestCorpSQL_TestcoreListen1 StatusKerberos 0 UInt32
AOAGTestCorpSQL_TestcoreListen1 CreatingDC \\XXX-YYY.ZZZTestcoreprod.local String
AOAGTestCorpSQL_TestcoreListen1 LastDNSUpdateTime 5/13/2021 7:27:18 PM DateTime
AOAGTestCorpSQL_TestcoreListen1 ObjectGUID 4433a14ed328844cb715ef273ca4266d String
AOAGTestCorpSQL_TestcoreListen1 DnsSuffix ZZZTestcoreprod.local String
- RegisterAllProvidersIP : Default value is 1. To resolve DNS returning both online and offline IPs, we changed this value to 0. This helps in application only getting active listener IP from DNS in first instance itself.
- HostRecordTTL: Default value is 1200 (20 minutes). Various articles recommend 300 seconds (5 minutes), but this requires thorough testing to determine the right value for a particular environment.
Run below PowerShell to make above changes. These changes will come into effect after we take cluster resource(AOAGTestCorpSQL_TestcoreListen1) offline and online.
PS C:\Windows\system32> Get-ClusterResource "AOAGTestCorpSQL_TestcoreListen1"| Set-ClusterParameter RegisterAllProvidersIP 0
WARNING: The properties were stored, but not all changes will take effect until AOAGTestCorpSQL_TestcoreListen1 is taken offline and then online again.
PS C:\Windows\system32> Get-ClusterResource "AOAGTestCorpSQL_TestcoreListen1"| Set-ClusterParameter HostRecordTTL 300
WARNING: The properties were stored, but not all changes will take effect until AOAGTestCorpSQL_TestcoreListen1 is taken offline and then online again.
Once the changes are done, verify that required changes has been made. After doing these changes, DNS started returning only active IP. We tested it from both the nodes by doing failover as well.
Executed when Node1 is active
PS C:\Windows\system32> nslookup Testcorelisten1
Name: Testcorelisten1.bhzTestcoreprod.local
Address: AAA.AA.AA.AAA
Executed when Node2 is active
PS C:\Windows\system32> nslookup Testcorelisten1
Name: Testcorelisten1.bhzTestcoreprod.local
Address: BBB.BB.BB.BBB
This solved application latency and timeout issues as well.
Please comment if it helps you & if you see any other scenarios which are not considered and covered here.