Announcement

Collapse
No announcement yet.

Avg. Disk Queue Length

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Avg. Disk Queue Length

    Our Exchange 2003 server has been suffering performance issues the past few days, with users getting the "Outlook is retrieving data from the Microsoft Exchange Server" message.

    I've been getting an unusual amount of alerts from IP Monitor:

    The threshold rate of the Disk Read Bytes per Second counter has been exceeded, indicating a high level of Input/Output (I/O) activity.; rtt: 5532ms
    Perfmon shows that the "Avg. Disk Queue Length" steadily near 100, and Pages/sec frequently spiking at 1.0.

    While researching possible causes, I came across another thread about "Avg. Disk Queue Length" at http://forums.petri.com/showthread.php?t=22855 , where Dumber wrote

    I've seen such thing before and as it turned out it was an empty battery of the controller.
    Since I had replaced the battery for the RAID controller six weeks ago, and had updated the firmware, I didn't think that was the problem. But I checked anyway.

    Sure enough, HP's Array Diagnostic Utility report indicated that the battery was dead, and the cache on the storage controller had been disabled because of that.

    Report for Smart Array 6i in slot 0
    -----------------------------------

    Smart Array 6i in slot 0 : Device Error Report

    Device Severity Error
    ------------------------ -------- ---------------------------------
    Smart Array 6i in slot 0 Critical The cache is temporarily disabled

    . . .

    Cache is temporarly [sic] disabled due to low battery voltage (0x0001)
    Hewlett-Packard tech support stated that the dead battery is probably the cause of the disk performance issues; they have seen this happen before.

    The good news is that I'm supposed to get a replacement battery later today, so I can replace it during the weekend. I'll update this post whether or not that resolves this issue.




    Attached Files
    Last edited by Robert R.; 3rd April 2009, 18:34.

  • #2
    Re: Avg. Disk Queue Length

    Thanks for the case study! Such a large disk queue length is startling, to say the least. I thought it was bad if it consistently showed a number of 2 or more. 100 is somewhere between "Get the fire extinguisher!" and "AAAAAHHHHHHHHHH!" Let us know what happens.
    Wesley David
    LinkedIn | Careers 2.0
    -------------------------------
    Microsoft Certifications: MCSE 2003 | MCSA:Messaging 2003 | MCITP:EA, SA, EST | MCTS: a'plenty | MCDST
    Vendor Neutral Certifications: CWNA
    Blog: www.TheNubbyAdmin.com || Twitter: @Nonapeptide || GTalk, Reader and Google+: [email protected] || Skype: Wesley.Nonapeptide
    Goofy kitten avatar photo from Troy Snow: flickr.com/photos/troysnow/

    Comment


    • #3
      Re: Avg. Disk Queue Length

      The metric to look for regarding Avg. Disk Queue length is 2 per spindle, so if you have a 3 drive array the AVG. Queue Length counter should not exceed 6 for extended periods. An occasional spike is not indicitive of a problem unless the spikes are continous and consistent.

      Comment


      • #4
        Re: Avg. Disk Queue Length

        The new battery was installed without any problem, and the HP Array Diagnostic Utility shows that it charged within a few hours. The cache on the storage controller is working now. I'll keep monitoring it to see if it fails within a few weeks, as the previous one did.

        So far, I haven't had any complaints about Outlook performance, except on my own computer. But I run Windows in a virtual machine on a Mac, which is often kind o' slow and has it's own issues. Also, it's only 3 working hours into Monday...


        However, the Avg. Disk Queue length on Exchange looks the same as it did last week.

        Is there any way to monitor a specific disk with Perfmon? If not, is there another tool that will?

        I should explain our setup:
        Exchange 2003 Enterprise Edition
        Windows 2003 Standard Edition
        server is HP Proliant DL380, with 4 GB of RAM
        C (system): two local disks (Ultra 320 SCCI 10K in RAID 1)
        D (page file, and anti-virus quarantine store): two local disks (Ultra 320 SCSI 10K in RAID 1)
        E (mail store): SAN with 4 GB fiber connection
        F: CD drive
        G (log files): SAN with 4 GB fiber connection
        Obviously, I expect a lot of disk activity to and from the mail store, since we have approximately 1,000 users. However, I don't know what a normal disk queue length for the SAN is, because (1) I shamefully admit it's not something I've watched closely before, and (2) we just moved our mail store to a SAN a few weeks ago, from an old Hitachi disk array.

        Does Perfmon monitor external disks, such as our SAN, or just the local disks?


        FYI: The page file configuration for this server is:
        C: 300 MB to 300 MB
        D: 2,048 MB to 4,096 MB
        Total paging file size for all drives:
        Minimum allowed: 16 MB
        Recommended: 5,374 MB
        Currently allocated 2,348 MB
        Any help, insight, or suggestions would be greatly appreciated.
        Last edited by Robert R.; 6th April 2009, 18:45.

        Comment


        • #5
          Re: Avg. Disk Queue Length

          Originally posted by Robert R. View Post
          Is there any way to monitor a specific disk with Perfmon? If not, is there another tool that will?
          You'd add a counter "PhysicalDisk" and, of course, choose the proper instance for the physical disk that you're most interested in.
          Wesley David
          LinkedIn | Careers 2.0
          -------------------------------
          Microsoft Certifications: MCSE 2003 | MCSA:Messaging 2003 | MCITP:EA, SA, EST | MCTS: a'plenty | MCDST
          Vendor Neutral Certifications: CWNA
          Blog: www.TheNubbyAdmin.com || Twitter: @Nonapeptide || GTalk, Reader and Google+: [email protected] || Skype: Wesley.Nonapeptide
          Goofy kitten avatar photo from Troy Snow: flickr.com/photos/troysnow/

          Comment


          • #6
            Re: Avg. Disk Queue Length

            You'd add a counter "PhysicalDisk" and, of course, choose the proper instance for the physical disk that you're most interested in.

            Thank you.

            I should have caught that the first time around. And if I had scrolled down a little bit in the "Select instances from list" window, I probably would have. Lesson learned.



            FYI: Disks 0 - 13 form the spanned H: disk, which is the old Hitachi disk array that's still attached.
            Attached Files
            Last edited by Robert R.; 6th April 2009, 18:41.

            Comment


            • #7
              Re: Avg. Disk Queue Length

              Here's the Disk Queue monitor for individual disks:


              C (system): two local disks (Ultra 320 SCSI 10K in RAID 1)
              D (page file, and anti-virus quarantine store): two local disks (Ultra 320 SCSI 10K in RAID 1)
              E (mail store): SAN with 4 GB fiber connection
              F: CD drive
              G (log files): SAN with 4 GB fiber connection
              Attached Files
              Last edited by Robert R.; 6th April 2009, 22:06.

              Comment


              • #8
                Re: Avg. Disk Queue Length

                Can you switch to report view and post another screen shot. It's a little tough to see what's happeneing from all the lines on the graph.

                As a side note, I always keep my perfmon's on the report view for two reasons:

                1. It's easier to read
                2. there's much less performance overhead in perfmon.

                Sometimes if you have a lot of counters and you are viewing it in graph mode it gets very sluggish moving from one counter to another or adding and removing counters.

                Comment


                • #9
                  Re: Avg. Disk Queue Length

                  Here's a screenshot of the report view:




                  Even though the current graph view shows the Disk Queue Length for E:, the SAN with the mail store, pegged at or near 100, the report view fluctuates between 1.x and 3.x. I don't understand this discrepancy.


                  Also, I was doing some monitoring with Sysinternals DiskMon.

                  One of the things I noticed last week was that read and write operations to the mail store disk were taking about 2 1/2 seconds. Today, the longest read operations are taking about 1/10 that. And I have not seen anything over 3/10 of a second today.


                  Thursday, April 2 2009:

                  165 2:34:00.226 PM 0.05002975 16 Read 547158671 8
                  166 2:34:00.226 PM 2.52897263 16 Read 619274943 8
                  167 2:34:00.241 PM 0.04738808 16 Read 1274727207 8
                  168 2:34:00.241 PM 2.52897263 16 Write 760027447 8
                  169 2:34:00.241 PM 2.52897263 16 Write 611642559 8
                  170 2:34:00.241 PM 2.52897263 16 Write 636573167 8
                  171 2:34:00.241 PM 2.52897263 16 Write 50842399 8
                  172 2:34:00.241 PM 2.52897263 16 Write 45019767 8
                  173 2:34:00.241 PM 2.52897263 16 Write 367776239 8
                  174 2:34:00.241 PM 2.52897263 16 Write 367246607 8
                  175 2:34:00.241 PM 2.52897263 16 Write 365259295 8
                  176 2:34:00.241 PM 2.52897263 16 Write 367765655 8
                  177 2:34:00.241 PM 2.52897263 16 Write 362552023 8
                  178 2:34:00.163 PM 0.05002975 16 Read 519318591 8
                  179 2:34:00.163 PM 0.05002975 16 Read 519318631 8
                  180 2:34:00.194 PM 0.04342079 16 Read 772580079 8
                  181 2:34:00.194 PM 0.05002975 16 Read 547158727 8
                  182 2:34:00.210 PM 0.04443169 16 Read 633682647 8
                  183 2:34:00.210 PM 0.05002975 16 Read 547158687 8
                  184 2:34:00.226 PM 0.05002975 16 Read 547158679 8
                  185 2:34:00.226 PM 2.52897263 16 Read 785034183 8
                  186 2:34:00.226 PM 0.15061378 17 Write 4425715 2
                  187 2:34:00.226 PM 0.05002975 16 Read 547158663 8
                  188 2:34:00.226 PM 0.15061378 17 Write 4425718 1
                  189 2:34:00.226 PM 2.52897263 16 Read 751806871 8



                  Monday, April 6, 2009:
                  325 2:42:40.344 PM 0.26235580 16 Read 345693135 16
                  326 2:42:40.344 PM 0.26235580 16 Read 345694767 32
                  327 2:42:40.344 PM 0.26235580 16 Read 345694991 8
                  328 2:42:40.359 PM 0.00664711 16 Read 1281360311 8
                  329 2:42:40.594 PM 0.00402451 16 Read 248450175 8
                  330 2:42:40.609 PM 0.00402451 16 Read 215853431 8
                  331 2:42:40.672 PM 0.00407219 16 Read 305434767 8
                  332 2:42:40.687 PM 0.00654221 16 Read 519463415 8
                  333 2:42:40.719 PM 0.26235580 16 Read 243118887 8
                  334 2:42:40.797 PM 0.00654221 16 Read 544147943 8
                  335 2:42:40.812 PM 0.00654221 16 Read 558578423 8
                  336 2:42:40.828 PM 0.00654221 16 Read 506974247 8
                  337 2:42:40.828 PM 0.01669884 17 Write 25544774 1
                  338 2:42:40.875 PM 0.26235580 16 Read 507286335 8
                  Attached Files
                  Last edited by Robert R.; 7th April 2009, 12:34.

                  Comment


                  • #10
                    Re: Avg. Disk Queue Length

                    The battery for the storage controller died over the weekend.

                    Here's a sample from Sysinternals' Diskmon. Note that the read times for Disk 16 -- the S.A.N. with the mail store -- are nearly 2 seconds; up from 0.26 seconds last week. It's only two working hours into Monday. I have a feeling that tomorrow is going to be worse.
                    20 10:08:39.107 AM 0.03487587 16 Read 368767583 8
                    21 10:08:39.123 AM 0.00254631 14 Write 291128 15
                    22 10:08:39.138 AM 1.93865776 16 Read 368769975 8
                    23 10:08:39.138 AM 1.93865776 16 Read 368770543 8
                    24 10:08:39.154 AM 0.15095711 16 Read 88813503 16
                    25 10:08:39.154 AM 0.15095711 16 Read 419274343 16
                    26 10:08:39.154 AM 0.15095711 16 Read 438309111 16
                    27 10:08:39.154 AM 0.15095711 16 Read 438309143 8
                    28 10:08:39.154 AM 1.93865776 16 Read 368766623 8
                    29 10:08:39.154 AM 0.25546074 16 Read 335122431 8
                    30 10:08:39.170 AM 0.15095711 16 Read 88813479 16
                    31 10:08:39.170 AM 0.15095711 16 Read 419274343 16
                    32 10:08:39.170 AM 1.93865776 16 Read 358543199 8
                    33 10:08:39.170 AM 0.15095711 16 Read 438309111 16
                    34 10:08:39.170 AM 1.93865776 16 Read 306544031 8
                    35 10:08:39.185 AM 0.15095711 16 Read 438309143 8
                    36 10:08:39.185 AM 1.93865776 16 Read 306548527 8
                    37 10:08:39.185 AM 0.15095711 16 Read 419274311 8
                    38 10:08:39.201 AM 0.01739502 16 Read 419274359 32
                    39 10:08:39.201 AM 0.01739502 16 Read 419274391 32
                    40 10:08:39.217 AM 1.93865776 16 Read 349893551 8
                    41 10:08:39.217 AM 0.03488541 16 Read 349893575 8
                    42 10:08:39.232 AM 1.93865776 16 Read 1224713495 8
                    43 10:08:39.232 AM 1.93865776 16 Read 1224727799 8
                    44 10:08:39.232 AM 1.93865776 16 Read 1224800671 8
                    45 10:08:39.232 AM 1.93865776 16 Read 1224798047 8
                    46 10:08:39.279 AM 0.04795074 16 Read 271742351 8
                    47 10:08:39.279 AM 1.93865776 16 Read 1243230623 8
                    48 10:08:39.279 AM 1.93865776 16 Read 1243196479 16
                    49 10:08:39.295 AM 1.93865776 16 Read 1243194831 8


                    In Perfmon, the Disk Queue Length for system disk C: -- the thick yellow line -- hovers around 20, but spikes to 100.



                    Attached Files
                    Last edited by Robert R.; 13th April 2009, 17:23.

                    Comment


                    • #11
                      Re: Avg. Disk Queue Length

                      Folks, do NOT trust this counter. The guidance of queue length not exceeding 2 per spindle is from those old days when RAID controllers and SAN were scarce. This is VERY unreliable counter to look at.
                      For better counters take a look at my post here: http://forums.petri.com/showthread.p...740#post158740

                      And if you have a Premier contract with MS, we actually teach those things at Vital Signs workshop. Basically there are around 20 counters you need to concentrate on when analyzing server performance. All the rest are more of a helper counters to the core ones. Ping your TAM for more info.
                      Guy Teverovsky
                      "Smith & Wesson - the original point and click interface"

                      Comment


                      • #12
                        Re: Avg. Disk Queue Length

                        Maybe you could shed some light on these 20 counters.

                        Comment


                        • #13
                          Re: Avg. Disk Queue Length

                          The basics are covered here:
                          http://technet.microsoft.com/en-us/m...e.aspx?pr=blog

                          I also suggest reading the AskPerf blog on Technet: http://blogs.technet.com/askperf/arc...e/default.aspx - it has a very good coverage of some counters and general bottleneck troubleshooting techniques.
                          Guy Teverovsky
                          "Smith & Wesson - the original point and click interface"

                          Comment


                          • #14
                            Re: Avg. Disk Queue Length

                            Thanks much Guy.

                            Comment

                            Working...
                            X