Archive for the 'Windows Server usage/configuration' Category

WinSvr 2012 R2 hanging due to event 129 from vsmraid: solved (for me)

I’ve been plagued by a problem where after running for 3-4 days (sometimes a longer interval, sometimes shorter) the performance of my Windows Server 2012 R2 system would just tank until rebooted.  The event log (System) would fill with event 129 from driver vsmraid, reading:

Lots of ineffective ideas and proposed solutions are on the web so I’ll point you to what worked for me:  Set AHCI Link Power Management – HIPM/DIPM to “Active”, which disables AHCI link power management.

The problem is apparently that some devices, e.g., certain SSDs, don’t respond properly (or at all?) to Link Power Management commands yet the Intel RAID drivers (or firmware?) apparently insist on sending them LPM commands.

To solve this you first change the registry so that the Power Settings applet shows AHCI Link Power Management options, then you set the option to “Active” which disables it (it means: let the device/link stay active and don’t try to send link power commands to it).  If that works, you win, if not, more drastic surgery is required: You set the registry to totally disable Link Power Management (aka “LPM”) to all devices.  I needed to do that.

Go to this excellent post by Sebastian Foss and follow steps 1 and 2.  Reboot and await results.  If that doesn’t solve your problem then follow step 3, which did it for me. (I didn’t do step 4.)

Here’s some more information: A question with discussion on TechNet, a tutorial with screenshots on how to enable the AHCI LPM power options in the Power Applet, and a SuperUser (StackExchange) discussion of it.  Also an excellent post from the NT Debugging blog explaining storage timeouts and event 129.  It’s only off in one key point: When he sums up, saying “I have never seen software cause an Event ID 129 error.”  Obviously, this post from 2011 predates this Intel LPM problem.

Hasn’t happened for two weeks now, so I’m declaring success.

P.S., here’s the information from Sebastian Foss’ post (linked above) just in case that post disappears:

I had several system freezes in Windows 10 Technical Preview (build 9926 – but I also had those freezes on earlier builds) on my Macbook Air 2013.
System Event-Log shows a warning for ID 129, storahci, Reset to device, \Device\RaidPort0, was issued.

Seems to be some problem related to the SATA-Controller and the SSD (In my case Apple/Samsung SM0128F)

I was able to fix the problem by editing several registry entries:

1. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\ PowerSettings\0012ee47-9041-4b5d-9b77-535fba8b1442\ 0b2d69d7-a2a1-449c-9680-f91c70521c60 and change the “Attributes” key value from 1 (default; hidden) to 2 (exposed). [This will expose “AHCI Link Power Management – HIPM/DIPM” under Hard Disk power settings]

2. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\ PowerSettings\0012ee47-9041-4b5d-9b77-535fba8b1442\dab60367-53fe-4fbc-825e-521d069d2456 and change the “Attributes” key value from 1 (default; hidden) to 2 (exposed). [This will expose “AHCI Link Power Management – Adaptive” under Hard Disk power settings]

Now you can edit AHCI Link Power Management options in your power profiles. You can either set them to “active” – or in my case I set them to HIPM. (Host-initiated) (While DIPM would be a device initiated sata bus power down).
Those settings control the behavior of the sata bus power state – they do not power down the device.

3. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\storahci\ Parameters\Device
Set NOLPM to * – those keys contain several hardware ID’s (vendor and device) for storage devices. Setting NOLPM to * disables LPM control messages to any storage device.

4. I also set SingleIO to * – never had any freezes or storahci warnings again.

I hope this helps those who have also been looking for a solution for a long time.

BR – Sebastian Foss

ReFS disk scrubbing doesn’t play nice with other work on the same disks

[Update: found the way to schedule ReFS disk scrubbing, see the end of the post.]

ReFS has great data integrity features, especially when running it on top of a Windows Storage Spaces resilient volume (e.g., when mirrored).  You can set it to do full file content integrity, which it does by keeping checksums of everything that’s written and then periodically scrubbing the disk and comparing the actual contents read to the expected checksum.  If one mirror reads bad data and the other mirror is correct then ReFS will fix up the bad copy.  This is all great stuff!  Except when it isn’t, of course … (why can’t I have my cake and eat it too?)

Today I was experiencing extremely sucky disk performance and couldn’t figure out why.  (If you must know, μTorrent kept reporting “disk overloaded” even when download/upload speeds were fairly low.)  I remembered that in the past I would have a day or two where I’d have extremely sucky disk performance but it would go away.  This time, it was annoying enough that I wanted to find out what was wrong.

TL;DR version: Periodic disk scrubbing had kicked in and was running full throttle on the disk.

I investigated this way:  First I looked at the Resource Monitor for Disk usage.  It showed that a volume I wasn’t using was having continuous high traffic.  In fact, it showed that System (PID 4) was reading a 70Gb tar file in a directory of 400Gb of tar files that I never ever touch.  (It is an enormous repository of Java/C++ sources I acquired for a project I started and haven’t actually worked on in a long time.)  I then checked Windows Defender:  It wasn’t scanning, and MsMpEng.exe wasn’t using any CPU either.  I don’t know what sparked my thought process, but I finally googled for “ReFS disk integrity” and found a suggestion to check the Event Log Microsoft/Windows/DataIntegrityScan (under Applications and Services Logs).

Sure enough, it showed a scan of my ReFS volume had commenced in the early afternoon and was still going at 9PM.  Looking back in the log just a short way I found the last such scan ran 3 weeks ago and took 40 hours to complete!  (It’s a 4.3Tb volume, striped as well as mirrored; I typically get sustained read speeds of ~170-180Mb/s and Resource Monitor was showing System reading this tar file at around 110Mb/s.) (I also discovered, in the logs, events showing that if the scrub is interrupted by rebooting it continues after boot.  I did reboot a couple of times today in order to fix an issue with my Logitech mouse device driver.  (Don’t ask.) I don’t know if it restarts the scan or continues from where it got interrupted, I presume the latter.)

To be perfectly precise here, my problem may be that I have two volumes running on the same underlying Windows Storage pool, that is, on the same disks.  The ReFS volume and an NTFS volume.  My μTorrent traffic is directed at the NTFS volume (so I can use smaller 4Kb disk clusters which plays better with μTorrent).  It is possible that the scrubber would behave better if the ReFS volume was the only user of the underlying Storage Pool.  (But if that’s the issue, it is rather lame for an otherwise very well implemented feature.)

I can’t find any documentation or blog posts anywhere on the net that explains how to either schedule these scrubs or cause it to throttle itself.

Update: The ReFS disk scrubber runs on a task schedule – see the Task Scheduler under Task Scheduler Library/Microsoft/Windows/Data Integrity Scan. Change the schedule to what will not impact you, or disable it altogether … but remember to run it manually before you get into trouble! I haven’t found a way to throttle so that it can run slowly and steadily without impacting other work on the box.

And after I did this I still had problems with unresponsiveness … and much more often! I tracked that down to “Regular Maintenance”. I can’t tell everything that goes on during “Regular Maintenance” but at least part of it is the defragger, which, on a large volume, is terribly slow. I had to go to Task Scheduler Library/Microsoft/Windows/TaskScheduler and disable Idle Maintenance—because even though I configured it to stop when the computer was no longer idle it just kept going and going and going. Also I changed the schedule on “Regular Maintenance” so it happens a lot less often (like, every other weekend). And finally, I disabled “Maintenance Configurator” because if you let that run it automagically resets your changes to the other maintenance tasks. (I forget where I read about that necessary fix but I wish I did so I could thank the guy.) I wish I knew if there was any “maintenance” I’ve turned off that I’ll miss later …

Failed EFI boot with 0xC000000F and missing winload.efi, running native

Did you just get a Windows boot failure

  • on an EFI boot machine
  • missing file \Windows\System32\winload.efi
  • error code 0xC000000F
  • when you are running native (boot from VHD, VHDX)
  • and just deleted a differencing disk
  • but did not first delete the BCD entry that referred to the differencing disk?

If so … boot from a Windows setup USB stick/DVD/whatever, and use BCDEDIT to delete the boot entry that still refers to the differencing disk. Then you’re good to go.

Apparently with a sufficiently bad entry in the BCD store you get a nasty catastrophic failure and don’t even get a choice to boot from one of the other installed operating systems. But don’t succumb to a heart attack. Correct it by booting from a different device (setting the BIOS boot order if necessary) and deleting the bad entry in the BCD.

(By the way, this superuser/stackoverflow page would have been a real help fixing more “normal” EFI boot problems if I hadn’t borked my machine in a particularly stupid way.)

ReFS on Windows 8.1/Server 2012 R2 and “ERROR 665” “The requested operation could not be completed due to a file system limitation”

ReFS on Windows Server 2012 R2/Windows 8.1 newly allows named streams (aka ADS) but only up to a limit of 128Kbytes.  If you copy, to an ReFS volume, a file with a named stream over this size limit you will get ERROR 665 (0x00000299): The requested operation could not be completed due to a file system limitation.

I discovered this copying 2.3M+ files from backups to a new ReFS volume on a system running Server 2012 R2.  All but 5 files copied without error.  The five which failed, with error 665, were all IE “favorites” (i.e., dinky files with extension “.url” formatted like an INI file).  Nothing funky-looking in their names (like odd Unicode characters, not that that should have mattered) or filepath length (not that that should matter either, for ReFS).  Took me awhile to figure it out—as of this writing there are no useful Google hits for this error number or message with the string “ReFS”—and also I believed that ReFS didn’t support named streams.  (But that limitation was lifted in Windows 8.1.)

Anyway, it turns out IE puts favicons in named streams and some of them are over 128Kb in length!  In my case, 5 out of thousands.

Since the Windows’ CopyFileEx and similar APIs copy named streams transparently the error message you receive from applications will have the file name, but not a stream name.

So this is one way to get a mysterious Error 665 “The requested operation could not be completed due to a file system limitation” when copying files to/restoring from backup to an ReFS file system.

(P.S.: Microsoft TechNet documentation on named streams on ReFS in Windows Server 2012 R2/Windows 8.1.)

Windows Server Storage Spaces: striped and mirrored with tiering requires 4 SSDs

I am building a new server and moving to Windows Storage Spaces with tiering (a nice Windows Server 2012 R2 feature).  The documentation is unclear (to me) and various web pages—in the nature of tutorials on how to set up tiering—said that you needed the same number of SSDs as “columns” in your virtual disk configuration.  Other documentation/web pages referred to “columns” as the stripe set size (for example, here) and even the PowerShell cmdlet argument names lined up with that.

But it turns out that for tiering you need columns × datacopies SSDs.  So if you want a RAID 10 (though of course Microsoft doesn’t call it that) where you’re striped (columns = 2)  and mirrored (number of data copies = 2) you need 4 SSDs, not 2.

Oh and by the way, when I was unable to create this kind of virtual disk (via PowerShell, you can’t do it anyhow via the New Virtual Disk wizard) the error message was somewhat unhelpful:  It certainly told me I didn’t have enough physical disks to complete the operation, but forgot to tell me which tier (SSD or HDD) was the source of the problem!

(So … I immediately ordered another 2 SSDs and, of necessity, an add-in SATA controller card (because I had already run out of motherboard SATA ports). I think I’m just going to use a tip I found somewhere on the web (don’t remember where or I’d provide a link) and just velcro my 4 SSDs together and lay them on the bottom of the case.)

(Also, FYI, I’m using 4x 64Gb SSDs and 4x 4Tb HDDs, and the ReFS file system.  Without going to the trouble of measuring performance I’m just going to go ahead and specify a write-back cache size of 20Gb, overruling the default 1Gb, because I’m going to be copying a lot of  large VHDs around and I’d rather have the copy complete quickly and then trickle onto the HDD in its own good time, than wait for it.  So I hope this works.)

Update: I did get this working.  And, as I finally configured it performance is fine: great read performance and good write performance.  I suppose I could get better performance (5%? 15%?) from a hardware RAID controller but I don’t need the last bit of oomph and I don’t want to be tied to a particular hardware RAID manufacturer.  So I’m happy with the way Windows Storage Spaces is working here.  However – n.b.: Write performance totally sucked (*) until I figured out that I needed to set the virtual disk stripe size to match the expected file system cluster size.  Thus, with 2 stripe sets, I set the interleave to 32Kb on the virtual disk I created to hold an ReFS file system (which always has a 64Kb cluster size) and to 16Kb on the virtual disk I created to hold an NTFS file system I created with 32Kb cluster size.

(*) Factor of 8 to 10!!