Archive for the 'Troubleshooting' Category

WinSvr 2012 R2 hanging due to event 129 from vsmraid: solved (for me)

I’ve been plagued by a problem where after running for 3-4 days (sometimes a longer interval, sometimes shorter) the performance of my Windows Server 2012 R2 system would just tank until rebooted.  The event log (System) would fill with event 129 from driver vsmraid, reading:

Lots of ineffective ideas and proposed solutions are on the web so I’ll point you to what worked for me:  Set AHCI Link Power Management – HIPM/DIPM to “Active”, which disables AHCI link power management.

The problem is apparently that some devices, e.g., certain SSDs, don’t respond properly (or at all?) to Link Power Management commands yet the Intel RAID drivers (or firmware?) apparently insist on sending them LPM commands.

To solve this you first change the registry so that the Power Settings applet shows AHCI Link Power Management options, then you set the option to “Active” which disables it (it means: let the device/link stay active and don’t try to send link power commands to it).  If that works, you win, if not, more drastic surgery is required: You set the registry to totally disable Link Power Management (aka “LPM”) to all devices.  I needed to do that.

Go to this excellent post by Sebastian Foss and follow steps 1 and 2.  Reboot and await results.  If that doesn’t solve your problem then follow step 3, which did it for me. (I didn’t do step 4.)

Here’s some more information: A question with discussion on TechNet, a tutorial with screenshots on how to enable the AHCI LPM power options in the Power Applet, and a SuperUser (StackExchange) discussion of it.  Also an excellent post from the NT Debugging blog explaining storage timeouts and event 129.  It’s only off in one key point: When he sums up, saying “I have never seen software cause an Event ID 129 error.”  Obviously, this post from 2011 predates this Intel LPM problem.

Hasn’t happened for two weeks now, so I’m declaring success.

P.S., here’s the information from Sebastian Foss’ post (linked above) just in case that post disappears:

I had several system freezes in Windows 10 Technical Preview (build 9926 – but I also had those freezes on earlier builds) on my Macbook Air 2013.
System Event-Log shows a warning for ID 129, storahci, Reset to device, \Device\RaidPort0, was issued.

Seems to be some problem related to the SATA-Controller and the SSD (In my case Apple/Samsung SM0128F)

I was able to fix the problem by editing several registry entries:

1. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\ PowerSettings\0012ee47-9041-4b5d-9b77-535fba8b1442\ 0b2d69d7-a2a1-449c-9680-f91c70521c60 and change the “Attributes” key value from 1 (default; hidden) to 2 (exposed). [This will expose “AHCI Link Power Management – HIPM/DIPM” under Hard Disk power settings]

2. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\ PowerSettings\0012ee47-9041-4b5d-9b77-535fba8b1442\dab60367-53fe-4fbc-825e-521d069d2456 and change the “Attributes” key value from 1 (default; hidden) to 2 (exposed). [This will expose “AHCI Link Power Management – Adaptive” under Hard Disk power settings]

Now you can edit AHCI Link Power Management options in your power profiles. You can either set them to “active” – or in my case I set them to HIPM. (Host-initiated) (While DIPM would be a device initiated sata bus power down).
Those settings control the behavior of the sata bus power state – they do not power down the device.

3. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\storahci\ Parameters\Device
Set NOLPM to * – those keys contain several hardware ID’s (vendor and device) for storage devices. Setting NOLPM to * disables LPM control messages to any storage device.

4. I also set SingleIO to * – never had any freezes or storahci warnings again.

I hope this helps those who have also been looking for a solution for a long time.

BR – Sebastian Foss

Failed EFI boot with 0xC000000F and missing winload.efi, running native

Did you just get a Windows boot failure

  • on an EFI boot machine
  • missing file \Windows\System32\winload.efi
  • error code 0xC000000F
  • when you are running native (boot from VHD, VHDX)
  • and just deleted a differencing disk
  • but did not first delete the BCD entry that referred to the differencing disk?

If so … boot from a Windows setup USB stick/DVD/whatever, and use BCDEDIT to delete the boot entry that still refers to the differencing disk. Then you’re good to go.

Apparently with a sufficiently bad entry in the BCD store you get a nasty catastrophic failure and don’t even get a choice to boot from one of the other installed operating systems. But don’t succumb to a heart attack. Correct it by booting from a different device (setting the BIOS boot order if necessary) and deleting the bad entry in the BCD.

(By the way, this superuser/stackoverflow page would have been a real help fixing more “normal” EFI boot problems if I hadn’t borked my machine in a particularly stupid way.)

Bad cables can masquerade as other errors

This is just a reminder:  A bad cable can masquerade as other errors.

In this case I was building a new server system – new motherboard, new drives, etc. keeping only the splendid case.  I had one particular new 4TB HDD that would drop out of a Windows Spaces storage pool (“Retired”, as if it had done some hard work and was now taking a well-earned rest)—sometimes within 10 minutes after booting, sometimes it would take an hour.

The event log before one of these drop outs would show a few bad commands, followed by a “bad block” error.

A S.M.A.R.T. diagnostic utility showed excessive errors in a couple of categories, none of them bad blocks.  One such category was “command timeouts” which was a clue, but I didn’t know how to interpret it.

Anyway, it was a brand new disk.  And I buy high-capacity HDDs all the time (I have nearly 50) yet I’ve never had any die of infant mortality.  So I tried moving the disk to a different port on the same brand-new motherboard controller (bad socket?), moving it to an add-in card controller (of the same type, Marvell) (bad motherboard chip?), moving it to a different port on a different motherboard controller (Intel chipset this time, bad driver?).  Failed each time!  So I ordered a new drive (same-day delivery!) and tried it…and it failed too in the same way!

Wracked my brain…and finally…swapped cables with an adjacent hard drive in the same enclosure.  Now the other drive failed!

So it was the cable.  Replaced the cable and all works fine.  I’ve copied 8Tb of data onto the new Storage Spaces array with no issues now.

I’ve never had a bad (internal case) cable before either.  And these were new cables.

Update 2014-09-09:  And bad network cables can make your PC connect at 100Mbps instead of 1Gbps!  This just happened to a colleague at work.

ReFS on Windows 8.1/Server 2012 R2 and “ERROR 665” “The requested operation could not be completed due to a file system limitation”

ReFS on Windows Server 2012 R2/Windows 8.1 newly allows named streams (aka ADS) but only up to a limit of 128Kbytes.  If you copy, to an ReFS volume, a file with a named stream over this size limit you will get ERROR 665 (0x00000299): The requested operation could not be completed due to a file system limitation.

I discovered this copying 2.3M+ files from backups to a new ReFS volume on a system running Server 2012 R2.  All but 5 files copied without error.  The five which failed, with error 665, were all IE “favorites” (i.e., dinky files with extension “.url” formatted like an INI file).  Nothing funky-looking in their names (like odd Unicode characters, not that that should have mattered) or filepath length (not that that should matter either, for ReFS).  Took me awhile to figure it out—as of this writing there are no useful Google hits for this error number or message with the string “ReFS”—and also I believed that ReFS didn’t support named streams.  (But that limitation was lifted in Windows 8.1.)

Anyway, it turns out IE puts favicons in named streams and some of them are over 128Kb in length!  In my case, 5 out of thousands.

Since the Windows’ CopyFileEx and similar APIs copy named streams transparently the error message you receive from applications will have the file name, but not a stream name.

So this is one way to get a mysterious Error 665 “The requested operation could not be completed due to a file system limitation” when copying files to/restoring from backup to an ReFS file system.

(P.S.: Microsoft TechNet documentation on named streams on ReFS in Windows Server 2012 R2/Windows 8.1.)