Announcement

Collapse
No announcement yet.

Deployment problem on Non-Domain Windows 2008R2 x64 server

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deployment problem on Non-Domain Windows 2008R2 x64 server

    I have SCOM 2007 R2 monitoring most of the servers in my company. I've had no major problems with any of it until I started trying to monitor workgroup computers rather than just domain one. It took a ton of time and manipulation to get the certificates working for that so it would even interface the agent with the RMS but I got it that far on two servers yesterday. One seemingly is working great (Window Server 2003 R2 x86) but the other one, a Windows Server 2008 R2 x64 box is behaving very strangely.
    It can't be the certificate element, because if so it wouldn't be able to have been approved in SCOM to add. There's only one other server with that OS that's being monitored in our environment and it's not having a problem, but it is also in the domain, which may or may not be the issue.
    Installed the 64bit agent on the server to match the OS and when I ran the MOMCertImport tool that was also from the AMD64 folder. The SCOM event log in windows on the server fills up with this error:

    Log Name: Operations Manager
    Source: Health Service Modules
    Date: 2/24/2011 12:29:02 AM
    Event ID: 26008
    Task Category: None
    Level: Error
    Keywords: Classic
    User: N/A
    Computer: Corp-Webdev-01
    Description:
    The Operations Manager event log on computer 'Corp-Webdev-01' is still corrupt. The Event Log Provider will attempt to recover by skipping over a possible bad record. The Provider may skip up to two records.

    One or more workflows were affected by this.

    Workflow name: Microsoft.SystemCenter.PowerManagement.Windows.Ser ver.2008.PowerPlan.EventTriggerDiscovery
    Instance name: Corp-Webdev-01
    Instance ID: {C6CD9BA5-161B-D57C-3E99-943846FCC713}
    Management group: MANAGEMENT_GROUP_NAME
    Event Xml:
    <Event xmlns="schemas.microsoft dot com/win/2004/08/events/event">
    <System>
    <Provider Name="Health Service Modules" />
    <EventID Qualifiers="0">26008</EventID>
    <Level>2</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-02-24T05:29:02.000000000Z" />
    <EventRecordID>290124157</EventRecordID>
    <Channel>Operations Manager</Channel>
    <Computer>Corp-Webdev-01</Computer>
    <Security />
    </System>
    <EventData>
    <Data>MANAGEMENT_GROUP_NAME</Data>
    <Data>Microsoft.SystemCenter.PowerManagement.Windo ws.Server.2008.PowerPlan.EventTriggerDiscovery</Data>
    <Data>Corp-Webdev-01</Data>
    <Data>{C6CD9BA5-161B-D57C-3E99-943846FCC713}</Data>
    <Data>Operations Manager</Data>
    <Data>0</Data>
    <Data>Illegal operation attempted on a registry key that has been marked for deletion.
    </Data>
    <Data>Corp-Webdev-01</Data>
    <Data>
    </Data>
    </EventData>
    </Event> and the same EID with slightly different details of:
    <EventData>
    <Data>MANAGEMENT_GROUP_NAME</Data>
    <Data>many</Data>
    <Data>many</Data>
    <Data>many</Data>
    I've tried clearing the log since it says it's corrupted, repairing the agent, repairing and immediately then clearing the log. Repairing or restarted the health service makes that error stop, but then the following appear: Log Name: Operations Manager
    Source: Health Service Modules
    Date: 2/24/2011 8:23:44 AM
    Event ID: 26017
    Task Category: None
    Level: Warning
    Keywords: Classic
    User: N/A
    Computer: Corp-Webdev-01
    Description:
    The Windows Event Log Provider monitoring the Operations Manager Event Log is 474 minutes behind in processing events. This can occur when the provider is restarted after being offline for some time, or there are too many events to be handled by the workflow.

    One or more workflows were affected by this.

    Workflow name: many
    Instance name: many
    Instance ID: many
    Management group: MANAGEMENT_GROUP_NAME
    Event Xml:
    <Event xmlns="schemas.microsoft dot com/win/2004/08/events/event">
    <System>
    <Provider Name="Health Service Modules" />
    <EventID Qualifiers="0">26017</EventID>
    <Level>3</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-02-24T13:23:44.000000000Z" />
    <EventRecordID>290124303</EventRecordID>
    <Channel>Operations Manager</Channel>
    <Computer>Corp-Webdev-01</Computer>
    <Security />
    </System>
    <EventData>
    <Data>MANAGEMENT_GROUP_NAME</Data>
    <Data>many</Data>
    <Data>many</Data>
    <Data>many</Data>
    <Data>Operations Manager</Data>
    <Data>0</Data>
    <Data>Unspecified error
    </Data>
    <Data>Corp-Webdev-01</Data>
    <Data>474</Data>
    </EventData>
    </Event> and Log Name: Operations Manager
    Source: Health Service Modules
    Date: 2/24/2011 8:23:45 AM
    Event ID: 26013
    Task Category: None
    Level: Warning
    Keywords: Classic
    User: N/A
    Computer: Corp-Webdev-01
    Description:
    The Operations Manager Event Log on computer 'Corp-Webdev-01' appears to have "wrapped" or been cleared while the Windows Event Log Provider was not active or behind in processing events. This error occurs when the provider is inactive for a period of time in which more events are logged than the event log can contain or the log is cleared. Some events were likely lost. To avoid this error in the future, make your event log larger or ensure that the agent service is not stopped for long periods.

    One or more workflows were affected by this.

    Workflow name: many
    Instance name: many
    Instance ID: many
    Management group: MANAGEMENT_GROUP_NAME
    Event Xml:
    <Event xmlns="schemas.microsoft dot com/win/2004/08/events/event">
    <System>
    <Provider Name="Health Service Modules" />
    <EventID Qualifiers="0">26013</EventID>
    <Level>3</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-02-24T13:23:45.000000000Z" />
    <EventRecordID>290124307</EventRecordID>
    <Channel>Operations Manager</Channel>
    <Computer>Corp-Webdev-01</Computer>
    <Security />
    </System>
    <EventData>
    <Data>MANAGEMENT_GROUP_NAME</Data>
    <Data>many</Data>
    <Data>many</Data>
    <Data>many</Data>
    <Data>Operations Manager</Data>
    <Data>30</Data>
    <Data>
    </Data>
    <Data>Corp-Webdev-01</Data>
    <Data>
    </Data>
    </EventData>
    </Event> The 26008 error spam locks up the server and makes it unresponsive. It happens in a high enough speed to trigger SCOM's disk performance alert. The last alert I've seen is this:
    Log Name: Operations Manager
    Source: HealthService
    Date: 2/24/2011 8:25:45 AM
    Event ID: 5401
    Task Category: Health Service
    Level: Warning
    Keywords: Classic
    User: N/A
    Computer: Corp-Webdev-01
    Description:
    Failed to replace parameter while creating the alert for monitor state change.

    Workflow: Microsoft.SystemCenter.HealthServiceModules.Window sEventLog.CorruptOrUnreadableEvents
    Instance: Corp-Webdev-01
    Instance ID: 07906A27-2DCD-70AF-B9D8-F2E14C5A3C58
    Management Group: Apprise

    Failing replacement: $Data/Context/EventDescription$
    Event Xml:
    <Event xmlns="schemas.microsoft dot com/win/2004/08/events/event">
    <System>
    <Provider Name="HealthService" />
    <EventID Qualifiers="32768">5401</EventID>
    <Level>3</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-02-24T13:25:45.000000000Z" />
    <EventRecordID>290124329</EventRecordID>
    <Channel>Operations Manager</Channel>
    <Computer>Corp-Webdev-01</Computer>
    <Security />
    </System>
    <EventData>
    <Data>{3A626B80-20EA-45A7-2C22-CB97CF1A32E6}</Data>
    <Data>E5AAD154-6650-6499-AD47-7E32A81423B4</Data>
    <Data>Microsoft.SystemCenter.HealthServiceModules. WindowsEventLog.CorruptOrUnreadableEvents</Data>
    <Data>$Data/Context/EventDescription$</Data>
    <Data>07906A27-2DCD-70AF-B9D8-F2E14C5A3C58</Data>
    <Data>Apprise</Data>
    <Data>Microsoft.SystemCenter.HealthServiceModules. WindowsEventLog.CorruptOrUnreadableEvents</Data>
    <Data>Corp-Webdev-01</Data>
    </EventData>
    </Event>

    A search online reveals only one or two other people with this problem and no resolutions to it either. I thought it could be possible that there's a bug in the Microsoft.SystemCenter.PowerManagement.Windows.Ser ver.2008.PowerPlan.EventTriggerDiscovery workflow, but if that was the case why isn't the other Windows Server 2008R2 x64 machine experiencing the same thing? To try to stop it for now I disabled the monitors and rules for the Windows Server 2008 R2 Power Plan on that server only, but that had no effect (I may have disabled the wrong part of it, I'm unsure). I also tried deleting the corrupted log file so it could be recreated. That only worked for a short while before it corrupted again.


    Any help would be appreciated, this is happening on an important system at my company that we really need to be able to monitor with SCOM.


    Thanks,
    Bryan

  • #2
    Re: Deployment problem on Non-Domain Windows 2008R2 x64 server

    Check the NTLM authentiction level for the workgrouped Windows 2008 R2 machine. It will be NTLMv2, comapre this setting with the server hosting SCOM. Also, compoare it with the Windows 2008 R2 server that doesn't have the issue. Is this server a member of the domain?

    Also, check for errors related to certificates and ensure that the workgroup server trusts the CA.

    Comment


    • #3
      Re: Deployment problem on Non-Domain Windows 2008R2 x64 server

      I checked the settings on both machines and attached screen shots of them. The 2003R2 machine is the SCOM RMS. The 2008R2 one is the workgroup machine having issues. I noticed it's setting was blank, but couldn't find what the default value was. The third screen shot is identical to the non working 2008R2 server's setting, "not defined". The only difference other than hardware and domain membership between the two 2008R2 machines is that the non working one is the standard version and the working one is the datacenter version, but I don't think that is the cause here.

      As for certificate errors. I know there wouldn't be any problems with those on the end of the RMS or the CA server as there is another workgroup server that is being monitored with none of these issues (it's a 2003R2 machine).

      The only errors I can find that would be at all related and aren't in the SCOM log would be the following:
      Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: MonitoringHost.exe (6152) consumed 15807692800 bytes, java.exe (1560) consumed 118063104 bytes, and McShield.exe (1752) consumed 74829824 bytes.

      That occurred a bunch of times on one of the occasions that SCOM started spamming errors, but it seems that was just because it was let go that way for longer than the times (overnight, it wasn't discovered to be doing that until early the next day). When I enable SCOM it doesn't start the errors above immediately. It'll have a few about the log being cleared or it being behind in processing for a few minutes, then it just starts checking in regularly, updating the status and settings, running the periodic checking scripts, ect. like it would normally do. Then at some point something happens and it just starts the above errors. They're spammed way to quick and fill up the log instantly so I haven't been able to see the very beginning of it yet. At the currently allowed log size of 15MB the time stamp from the beginning to the end is only 9 seconds in duration. I've set it to no overwrite events with the same size limit, but I can't turn SCOM back on on the server today because it's too close to the end of the day, so it may not happen until too late to be found today and performance is impacted on that server when this happens as the memory error demonstrates. I will try it early tomorrow and see if anything helpful appears.
      Attached Files

      Comment


      • #4
        Re: Deployment problem on Non-Domain Windows 2008R2 x64 server

        I attached the log from when I re-enabled SCOM on Friday, very truncated of course but the part that's missing is entirely filled with more of the 26008 and 26007 errors and nothing else. All events indicate it's working well until the start of the errors with 26007.

        It's to close to be a cooincidence and must be related, but the whole time I was running it I was RDPed into the machine as a local administrator account which also happens to be the the action account for the SCOM agent. To the second, an event marking the log off is the same as the first 26007 event. This baffles me even more. This is done the same way with the working Server 2003 R2 workgroup managed server (I know, different OS), but it proves logging on as the action account doesn't mess up SCOM in all scenarios.

        I'm going to try logging on to the one other SCOM monitored 2008R2 server as it's action account to see if the same thing will happen when I log off as well as starting the SCOM service on this problem server from another account that's not the action account and will post back. I thought I'd throw the log up here with some more explanation in case this new direction is a waste of time or someone else has another idea.

        Thanks for the help so far.
        Attached Files

        Comment


        • #5
          Re: Deployment problem on Non-Domain Windows 2008R2 x64 server

          I tested the logon/logoff as action account on the other 2008R2 server and it had no effect. The agent on the problem server is still working without any problems so far, however I'm still baffled as to why that causes such a problem.

          Comment

          Working...
          X