WordsOnTech

Fixing Application Latency and Timeout issues in Multi-Subnet Always ON SQL Server Setup

Issue

Application Team complained about latency in connecting to database through listener name. They also complained about application taking more time to connect to database after Always ON Group failover. Sometimes they observed application timeouts as well.

  1. Underlying database was part of Multi-Subnet Always ON SQL Server.
  2. Application was connecting to database using Listener IP. As it was Multi-Subnet database, Listener was having two IPs. One will remain online and another will remain offline depending upon which subnet was hosting the Always On Group at a particular time.

Observation and Analysis

We ran below PowerShell commands to have a look at all the Cluster resources. In this example :

Always ON group Name : AOAGTestCorpSQL Always ON Listener Name : TestcoreListen1 Assigned IPs to Listener : AAA.AA.AA.AAA (Currently Online ) , BBB.BB.BB.BBB (Currently Offline) Please note green highlighted Network Name, this would be of great use in fixing issue.

PS C:\Windows\system32> Import-Module FailoverClusters

PS C:\Windows\system32> Get-ClusterResource

Name                            State   OwnerGroup      ResourceType
----                            -----   ----------      ------------
AOAGTestCorpSQL                 Online  AOAGTestCorpSQL SQL Server Availability Group
AOAGTestCorpSQL_AAA.AA.AA.AAA   Online  AOAGTestCorpSQL IP Address
AOAGTestCorpSQL_BBB.BB.BB.BBB   Offline AOAGTestCorpSQL IP Address
AOAGTestCorpSQL_TestcoreListen1 Online  AOAGTestCorpSQL Network Name
Cluster IP Address              Online  Cluster Group   IP Address
Cluster IP Address DDD.DD.DD.DD Offline Cluster Group   IP Address
Cluster Name                    Online  Cluster Group   Network Name
File Share Witness              Online  Cluster Group   File Share Witness

Everything looked good at database end and it seemed to be more of a configuration issue.

We ran below PowerShell command to see what DNS returns for Always ON Listener Name :

PS C:\Windows\system32> nslookup Testcorelisten1
Server:  ActiveHostname.ec2.internal
Address:  ActiveNodeIP

Name:    Testcorelisten1.bhzTestcoreprod.local
Address:  AAA.AA.AA.AAA
Address:  BBB.BB.BB.BBB

We observed that both online and offline IPs were being retuned from DNS. We checked this from application server as well as DB server and got same result. While doing Test Connection from application server to database, sometime offline IP was being tried first and online IP later. This IP reporting looked completely random. This was resulting in latency in making connections.

Solution 1:

We asked for connection string from application team and observed that they were not using MultiSubnetFailover=true . This should be used for multi subnet setups. However it was not possible for them as it was a legacy application. Sometime making changes at application end is not desired by customer.

Solution 2:

We ran below PowerShell command to see values for two important highlighted settings:

PS C:\Windows\system32> Get-ClusterResource -Name AOAGTestCorpSQL_TestcoreListen1 | Get-ClusterParameter

Object                          Name                   Value                                   Type
------                          ----                   -----                                   ----
AOAGTestCorpSQL_TestcoreListen1 Name                   TestcoreLISTEN1                         String
AOAGTestCorpSQL_TestcoreListen1 DnsName                TestcoreListen1                         String
AOAGTestCorpSQL_TestcoreListen1 Aliases                                                        String
AOAGTestCorpSQL_TestcoreListen1 RemapPipeNames         1                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 HostRecordTTL          1200                                     UInt32
AOAGTestCorpSQL_TestcoreListen1 RegisterAllProvidersIP 1                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 PublishPTRRecords      0                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 ADAware                1                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 ResourceData           {1, 0, 0, 0...}                         ByteArray
AOAGTestCorpSQL_TestcoreListen1 StatusNetBIOS          0                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 StatusDNS              0                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 StatusKerberos         0                                       UInt32
AOAGTestCorpSQL_TestcoreListen1 CreatingDC             \\XXX-YYY.ZZZTestcoreprod.local String
AOAGTestCorpSQL_TestcoreListen1 LastDNSUpdateTime      5/13/2021 7:27:18 PM                    DateTime
AOAGTestCorpSQL_TestcoreListen1 ObjectGUID             4433a14ed328844cb715ef273ca4266d        String
AOAGTestCorpSQL_TestcoreListen1 DnsSuffix              ZZZTestcoreprod.local                   String
  1. RegisterAllProvidersIP : Default value is 1. To resolve DNS returning both online and offline IPs, we changed this value to 0. This helps in application only getting active listener IP from DNS in first instance itself.
  2. HostRecordTTL: Default value is 1200 (20 minutes). Various articles recommend 300 seconds (5 minutes), but this requires thorough testing to determine the right value for a particular environment.

Run below PowerShell to make above changes. These changes will come into effect after we take cluster resource(AOAGTestCorpSQL_TestcoreListen1) offline and online.

PS C:\Windows\system32> Get-ClusterResource "AOAGTestCorpSQL_TestcoreListen1"| Set-ClusterParameter RegisterAllProvidersIP 0

WARNING: The properties were stored, but not all changes will take effect until AOAGTestCorpSQL_TestcoreListen1 is taken offline and then online again.

PS C:\Windows\system32> Get-ClusterResource "AOAGTestCorpSQL_TestcoreListen1"| Set-ClusterParameter HostRecordTTL 300

WARNING: The properties were stored, but not all changes will take effect until AOAGTestCorpSQL_TestcoreListen1 is taken offline and then online again.

Once the changes are done, verify that required changes has been made. After doing these changes, DNS started returning only active IP. We tested it from both the nodes by doing failover as well.

Executed when Node1 is active 
PS C:\Windows\system32> nslookup Testcorelisten1
Name: Testcorelisten1.bhzTestcoreprod.local
Address: AAA.AA.AA.AAA

Executed when Node2 is active 
PS C:\Windows\system32> nslookup Testcorelisten1
Name: Testcorelisten1.bhzTestcoreprod.local
Address: BBB.BB.BB.BBB

This solved application latency and timeout issues as well.

Please comment if it helps you & if you see any other scenarios which are not considered and covered here.

Exit mobile version