Announcement

Collapse
No announcement yet.

DELL MD3000i Issue with VMware ESX

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • DELL MD3000i Issue with VMware ESX

    Hi all,

    In my company we have installed a MD3000i iSCSI.
    I've configured a VMware cluster with 2 nodes HP DL380 G5, 2 GB switch Powerconnect 5424 from DELL and one strorage MD3000i with two controllers and last firmware and NVram. Every link is correctly redounded and VMware can see 4 path for every LUN (Storage, too). Everything seems to work well, until I create or I make some other standard management operations on MD3000i's LUN: during those operations VMware iSCSI driver go on fault because of some timeouts on iSCSI occure, and after that, the VMware cluster go on fault and the only thing I can do, is to reset the two servers from the button.
    I' ve just involved in the case either DELL and VMware (even HP). They have verified links, configurations, DELL had already changed one storage controller, we have verified IRQ on HP servers. We have controlled that VMware could access the LUNs on different path on different controllers... and after a month of many other controlls, neither DELL or VMware is able to resolve the issue.

    In short, technicians from VMware have checked the logs from the cluster and have seen high response time from the storage (in the order of seconds). Inexorably every time i create a LUN or something similar, response time grow up and the iSCSI VMware driver got fail. The strange thing is that meanwhile the VMware Cluster couldn't see the storage even from the console, a physical Linux server with Oracle DB on the storage (on a different virtual disk) doesn't show any issue, it regularly works.


    Any help is appreciated.

    thank you

  • #2
    Re: DELL MD3000i Issue with VMware ESX

    just a few questions, Dell support might have already asked:
    1. have you tried connecting the MD directly to the servers, bypassing the switch?
    2. are you using the clustered version of NVRAM? is the LUN type set to Linux?
    3. if you set up a box as iSCSI target (linux/windows target software, or openfiler maybe), and hook it up to the HP servers running ESX, does that work?

    basically, what you have is quite a few components:
    HP servers, VMWare ESX software (the software initiator built in), NICs in those servers, the LAN cables, the switch, the MD box.
    Try to remove components, in order to isolate the issue, because within the aforementioned (and probably not quite complete) list, any of the components might be at fault

    I would call Dell again, ask to speak to someone who specializes in the MD's and VMWare, and have them clear the settings on the MD completely, then walk me through the entire setup from scratch. After a month of troubleshooting, knowing the guys at Dell, they would probably do that for you.
    Of course be ready to erase all data on the MD, and hook the MD up to the servers without the PCxxxx switch, so that you eliminate that switch from the equation.
    Real stupidity always beats Artificial Intelligence (c) Terry Pratchett

    BA (BM), RHCE, MCSE, DCSE, Linux+, Network+

    Comment


    • #3
      Re: DELL MD3000i Issue with VMware ESX

      Prefix that this is a production environment, and so it is not so easy and handy to make tests and to cause inefficiency. We don't have any backup production environment yet.
      1. have you tried connecting the MD directly to the servers, bypassing the switch?
      * No, because DELL's technicians told me that switches are are ok. Personally I made some tests bypassing one of the two switches, but the same problem.
      2. are you using the clustered version of NVRAM? is the LUN type set to Linux?
      * I'm using this NVRAM version N1532-735890-004, and the LUN type is set to Linux
      3. if you set up a box as iSCSI target (linux/windows target software, or openfiler maybe), and hook it up to the HP servers running ESX, does that work?
      * I've not tried yet, but HPs don't have any problem to connect to the storage ( obviously when stoage doesn't have any work in progress ).
      I can add some other informations from VMware support. This is what a technician tell me after some analysis:

      ************************************************** ******
      Regular vm-support logs:
      On both ESX hosts we can see multiple errors in vmkernel like:
      Mar 23 12:07:29 esx1 vmkernel: 1:23:28:36.484 cpu3:1079)iSCSI: session 0x6e2c190 sending mgmt 82507164 abort for itt 82505186 task 0x6e01ee0 cmnd 0x3 f62c500 cdb 0x2a to (0 0 2 0) at 17091649
      Mar 23 12:07:29 esx1 vmkernel: 1:23:28:36.486 cpu3:1080)iSCSI: session 0x6e2c190 abort success for mgmt 82507164, itt 82505186, task 0x6e01ee0, cmnd 0x3f62c500, cdb 0x2a
      They are related to above described performance problems and high latency in response from disk array. Since there is slow response from disk array commands are failing and they are aborted.
      Based on above I do not believe that problem is related to software for now. To support this, we see problem on two different servers around the same time. Also errors are the same on both servers.

      I see following as root cause of the problem for now, I list them by possibility:
      1. SATA disks in RAID5 are to slow in response, causing delays which you can see highlighted in attached files.
      Recommendation:
      - engage Dell to measure disk and storage processors utilization during the working day.
      2. possible IRQ sharing conflict influencing performance. Both ESX hosts use the same IRQs for different cards:
      vmknic4 and vmknic2 are used for iSCSI:
      011:00.0 8086:105e 103c:7044 Ethernet Intel 5/ 16/0x79 A V e1000 vmnic2
      014:00.0 8086:105e 103c:7044 Ethernet Intel 7/ 17/0x81 A V e1000 vmnic4
      the same IRQ (16,17) is used by intel and broadcom:
      003:00.0 14e4:164c 103c:7038 Ethernet Broadcom 5/ 16/0x79 A V bnx2 vmnic0
      005:00.0 14e4:164c 103c:7038 Ethernet Broadcom 7/ 17/0x81 A V bnx2 vmnic1
      also USB is using those IRQs:
      000:29.1 8086:2689 103c:31fe USB Intel 7/ 17/0x81 B C
      000:29.0 8086:2688 103c:31fe USB Intel 5/ 16/0x79 A C
      So you can see that IRQ 16 and 17 are shared between Intel, Broadcom and USB devices.
      Recommendations:
      - disable USB in server Bios as a test
      - try using different IRQs for Intel and Broadcom NICs
      3. We have advanced setting for ESX:
      DiskMaxIOSize (Max Disk READ/WRITE I/O size before splitting (in KB)) [32-32767: default = 32767]: 32767
      So by default ESX will be sending 32MB of data to disk array. It could be that disk array is not able to process that much data from multiple ESX serves.
      If you tried suggestions from point 2 and there was no improvement, try reducing this setting on all ESX hosts to 1024.
      If you see imporovement you can leave it, otherwise please change it back to original value.
      Maybe you can try to implement this outside peak hours ... In case this is not working well for your environment.
      ************************************************** ******

      Actually HP exluded any issue related to IRQ. Sharing IRQ addresses is normal and from BIOS is not possible to avoid this. From VMware console I deactivate USB service.
      I've not implemented the point 3.
      The ESX servers have GB NICs with software initiator, no setup for jumbo frame.
      It seems that some iSCSI sommands are not recognized by the storage, and so it respondes to this commands with timeouts.

      Comment


      • #4
        Re: DELL MD3000i Issue with VMware ESX

        how is iSCSI set up, by the way? any encoding/authentication/restrictions in place?

        I know for certain the MD3000i works perfectly well with ESX installed on Dell PE servers, since that is what we have tested the MD's with. So there might be some kind of issue with HP interoperability, although I really doubt it.

        Maybe, if possible, get a generic Intel e1000 server series NIC into those servers, to make sure the HP o/b NICs are not at fault here.

        Another point: SATA drives _are_ slow, mainly not because they are in RAID5, but because the controller in the MD is a SAS controller, so there is a constant ATAtoSCSI conversion overhead.

        As for the IRQ's - IRQ sharing is nice and fine, but try to disable as many devices as possible, to ensure IRQ's are not being fought over by different NICs. disable all the broadcoms in the BIOS, also disable USB, and see what happens.


        PS: and the NVSRAM is the correct one. but from my own experience, ask a Dell techie to send you a couple of older versions, just for testing. remember to allow the MD some time to update both controllers, since you only update one, and then the new NVRAM is passed from the update controller to the second one
        Last edited by DYasny; 31st March 2009, 15:23.
        Real stupidity always beats Artificial Intelligence (c) Terry Pratchett

        BA (BM), RHCE, MCSE, DCSE, Linux+, Network+

        Comment


        • #5
          Re: DELL MD3000i Issue with VMware ESX

          Thank you for reply.
          how is iSCSI set up, by the way? any encoding/authentication/restrictions in place?
          * No, any CHAP or other type of encoding
          I know for certain the MD3000i works perfectly well with ESX installed on Dell PE servers, since that is what we have tested the MD's with. So there might be some kind of issue with HP interoperability, although I really doubt it.
          * me too

          Maybe, if possible, get a generic Intel e1000 server series NIC into those servers, to make sure the HP o/b NICs are not at fault here.
          * Yes, it could be a test, like the the switch one...but we have 4 double ports NICs from HP that are certified for those ProLiant servers

          Another point: SATA drives _are_ slow, mainly not because they are in RAID5, but because the controller in the MD is a SAS controller, so there is a constant ATAtoSCSI conversion overhead.
          * Yes I know. It could be one of the delay issues, but why DELL technicians don't say that the reason of this problem are too slow HD and why they don't suggest to try some SAS 15000 rpm HDs?

          As for the IRQ's - IRQ sharing is nice and fine, but try to disable as many devices as possible, to ensure IRQ's are not being fought over by different NICs. disable all the broadcoms in the BIOS, also disable USB, and see what happens.
          * I've already done what it was possible, but with not success. I cannot disable

          PS: and the NVSRAM is the correct one. but from my own experience, ask a Dell techie to send you a couple of older versions, just for testing. remember to allow the MD some time to update both controllers, since you only update one, and then the new NVRAM is passed from the update controller to the second one
          * DELL don't propose any older version... but I've not understand how can allow the MD to update both controllers. I've used MDSM and I thought that everything was automatical.

          Comment


          • #6
            Re: DELL MD3000i Issue with VMware ESX

            Originally posted by SpssSYS View Post
            * Yes, it could be a test, like the the switch one...but we have 4 double ports NICs from HP that are certified for those ProLiant servers
            have you tried breaking the NIC bonds and using only a single NIC for iSCSI communication?

            * Yes I know. It could be one of the delay issues, but why DELL technicians don't say that the reason of this problem are too slow HD and why they don't suggest to try some SAS 15000 rpm HDs?
            because MD works fine when you use a simple linux based system as initiator, instead of ESX. It really looks like a VMware issue to me, not hardware related at all.
            What I suggest is - remove Dell from the equation completely - take a simple test box, install openfiler, and set it up as the iSCSI target. connect to it through another gigabit switch as well. try the usual tests and see if that works for you better than the MD3000i. that way you will be able to remove at least one vendor from the equation.

            DELL don't propose any older version... but I've not understand how can allow the MD to update both controllers. I've used MDSM and I thought that everything was automatical.
            everything is automagical in MDSM, but you have to give it some time to work
            Real stupidity always beats Artificial Intelligence (c) Terry Pratchett

            BA (BM), RHCE, MCSE, DCSE, Linux+, Network+

            Comment


            • #7
              Re: DELL MD3000i Issue with VMware ESX

              Originally posted by DYasny View Post
              have you tried breaking the NIC bonds and using only a single NIC for iSCSI communication?
              [...]
              I give you some other infos: we bought the Storage (single controller) about one year ago. We use it with a physical Red Hat enterprise DB Oracle server (with only one single Intel GB NIC) and with a VMware server ESX Standard (one double GB NIC from Intel by DELL with a single Link/path). Between servers and storage we had only one PowerConnect 5424. On the storage there were only 2 VM (the others were on the local VMware server). Everything works fine even during LUN creation.
              After that, now we have: 2 controllers, 2 switches, 2 new VMware servers and about 15 VM on the storage. In past I could say that the storage was less used respect today, but

              Originally posted by DYasny View Post
              It really looks like a VMware issue to me, not hardware related at all.
              VMware's technicians are not by the same opinion

              Comment


              • #8
                Re: DELL MD3000i Issue with VMware ESX

                well, if you want to be certain - do as I said and set up a test iSCSI target to compare perfomrances

                VMWare techs, like all support techs everywhere, tend to push complicated issues away
                Real stupidity always beats Artificial Intelligence (c) Terry Pratchett

                BA (BM), RHCE, MCSE, DCSE, Linux+, Network+

                Comment


                • #9
                  Re: DELL MD3000i Issue with VMware ESX

                  Hi guys

                  I would like to weigh in here with my experiences. We are having this issue also. In our case if we use Shadowprotect from within a guest Windows OS, we get these errors consistently, the entire VMWare machine crumples to its knees for about 30 seconds, then slowly ramps up and the whole process happens again.

                  The scenario we have had this on is basically HP 380DL, using both onboard Broadcom ToE nics and HP dual addon coards. Same outcome, although the add in cards (non-ToE) cause the same problems occasionally in general operations.

                  The SANs we have tested are Infortrend Eonstore ES S12E-R1132-4 (dual controller) and the single controller model (can't remember the model number). The firmware is the latest FA36.P07.

                  We have the same problem with VMWare blaming the SAN vendor. However, I did some comprehensive testing which points to VMWare.

                  1. When the ESX server is in its "dead" phase, another PC (my laptop) running Windows with its iSCSI initiator using the same port of the SAN on different LUN is still transacting perfectly. Neither the server, nor the PC ever drive the iSCSI channel anywhere near to saturation. I would say about 500Mb/S full duplex with both machines full tilt. This shows that the switch, SAN port, SAN logic is not falling over.

                  2. If I run Windows iSCSI initiators to the SAN from within the guest OS (Windows) directly to the SAN via VMWare virtual networking, I can go as hard and fast as I like and there are no issues whatsoever. This proves that the VMWare virtual networking, switch, etc is working perfectly and really proves that we have end to end network and SAN reliability.

                  The problem is definately in the VMWare SCSI subsystem. I have seen a lot of similar issues posted on forums for internal RAID as well. Obviously the iSCSI errors are not in the log, but all the same LINScsi errors.

                  I am tearing my hair out with being bounced around between vendors. I believe it is a VMWare issue and I am sure there is a relatively simple tuning fix within VMWare.

                  I just don't know how to get it and we are supposed to be going live with a rollout next weekend. A site we have now is just hanging in there but the problem shows its face from time to time and I need to fix that one before the customer loses patience.

                  Regards

                  Mark Dutton
                  Datamerge

                  Comment

                  Working...
                  X