Archive for the 'Windows Programming' Category

Breaking a Windows command line into separate arguments, respecting quotes and backslashes

I went on a side track recently and discovered the strangely intricate world of breaking a Windows command line into arguments.  That is, how do you do Windows command line lexing?  (By established convention, command line parsing refers to interpreting arguments as options to programs: interpreting flags, collecting file names, handling missing required arguments, etc.)

TL;DR: I wrote a fully tested C# library to do this and it is on github and nuget for your public domain amusement.  (It’s also on symbolsource.org, for your source debugging needs, but I can’t get it to work in my VS2013 environment … let me know if it works for you.)

Most of the time there’s no need to worry about breaking up a command line into arguments.  Your C/C++ program gets them pre-lexed as arguments to main(): the well known argv and argc, handled by your compiler’s runtimes. And your C# program gets a string[] args array, handled by the .NET assembly launcher. And for most occasions, that’s sufficient.

But maybe it isn’t. For example, I was trying to use Clang’s libclang to process some C++ source code. An excellent resource if you want your C++ lexed, parsed, and indexed. But to get it going you’ve got to pass compiler command line arguments to the function which parses a translation unit. Those arguments must include all the include directories, preprocessor symbol definitions, and everything else that you’d ordinarily pass to your compiler (in clang’s case, these are normally gcc’s options). A lot of times these are build into makefile macros or even more difficult to reach locations—like inside of Visual Studio’s project files.

For my purposes I wanted to grab them from MSBuild logfiles so I could get the actual command lines as seen by Visual C++. And that meant, I needed to lex a command line into arguments.

So that turns out to be intricate, as I said above. The key issue is caused by a…unfortunate design choice?…mistake?…that dates back to MS-DOS/PC-DOS 2.0: The use of the backslash as the directory separator character in a path string. Since in C and C-derived languages (and many other languages) the backslash is used as an escape character in a double-quoted string literal, and since paths containing backslashes are often passed as arguments to programs, and since those paths are frequently in double-quoted arguments (to protect blanks and other special characters) there’s a conflict that leads to confusing interactions between quoted arguments and escaped characters and path strings.

In this article on MSDN, Parsing C++ Command Line Arguments, Microsoft describes the rules: note the special cases for even or odd sets of backslashes immediately followed by a double quote character, versus a set of backslashes not so followed. But it’s more complex than that. There is a special rule for backslashes at the end of the string. There is special handling of the first (“zeroth”) argument on the command line: The executable path. The rules changed slightly in 2008. And some programs don’t use Visual C++’s runtime to lex arguments, they use the Windows API CommandLineToArgvW to do it—and wouldn’t you know, it handles things slightly differently.

I ended up writing a C# library that lexed arguments, letting you choose between the Visual C++ way of doing it or the CommandLineToArgvW way of doing it. There are also routines for “requoting” arguments properly so that you can form them back up into a command line. (I haven’t done globbing yet, but that’s coming.) I’ve put it on github (with a public domain license, so party on) and it’s on nuget as well. Bug reports, discussion, praise is all cheerfully accepted at the github site (or as comments here).

Naturally, I didn’t figure out the crafty little details myself. I relied on a reports written by a bunch of people who got there first. And, here are links to that work, which were quite useful to me:

 

 

Does enumerating files with FSCTL_ENUM_USN_DATA ever miss any files?

Short answer: No and yes.

I’ve been playing with enumerating all the files on a volume using the FSCTL_ENUM_USN_DATA IOCTL, which reads the MFT.

It is supposed to be very fast, especially compared with other methods such as FindFirst/FindNext.  (There’s a great forum posting on it here, “Reading MFT, which also contains links to posted source code here, and several posts by the same author here, all of which together make a great starting point.)

However, a question is quickly raised when you’re trying this out:  Does the enumerating files through the MFT this way miss any?  If you try a different traversal technique, e.g., using the .NET Framework APIs Directory.GetDirectories() and Directory.GetFiles() you’ll get a different number of results.

In fact, on my system (running Windows Server 2008 R2) I found 1419 directories and files with the MFT enumeration that weren’t listed with the .NET API traversal, and an astonishing (to me) 19199 extra directories and files with the .NET API traversal than with the MFT traversal.  What’s going on with all those “missed” files?  Did I have a bug in my MFT enumeration code?

No.  The answer is: Hard links.  (And symbolic links.)

When scanning the MFT with FSCTL_ENUM_USN_DATA you see each directory and file once and only once, no matter how many directory entries point to it.  For example, on my system, traversing C: with the .NET APIs returns 6 files named “write.exe”, but the MFT enumeration has only 2.

In fact, by using the command “fsutil hardlink list c:Windowswrite.exe” I see that that single file has four names:

(The other two instances of “write.exe” are a single separate file that has two links to it.)

I had no idea that in a standard installation there were so many hard links used. In fact, it even seems that some application installers create multiple hard links to the same file (e.g., MiKTeK).

And that explains nearly all of the files “missed” by the MFT enumeration.

Depending on the reason that you’re enumerating directories and files on a volume, this may or may not be an issue for you.  Actually, likely, it is an issue for you and you may need to resolve it by traversing the directory using FindFirst/FindNext or the .NET APIs and reconciling the two collections.  (Given a filename you can use the Find{First/Next}FileName functions to get all the names of a given file, i.e., all of the names of the hard links to the file. But it may be expensive to use this on every file just to find the ones that have multiple links.  Reconciling with the other kind of traversal may be the better bet – I have yet to measure this.)

On the other hand MFT enumeration does find files that the .NET traversal does not. There are a large number of files under WindowsSystem32 that are returned on the MFT enumeration but not the .NET traversal.  I’m not sure why; it doesn’t appear to be security related.  The .NET enumeration won’t descend past reparse points like “Documents and Settings”, but MFT enumeration won’t descend into directory mount points (FindFirst/FindNext will go through mount points, and I haven’t tried .NET enumeration on that yet.) The MFT enumeration also returns the directory “System Volume Information” and some files under it.  And it also returns directories and files related to the Transactional Resource Manager, namely the directory “$RmMetadata” and its contents.

(The latter is the cause of some minor coding confusion:  The MFT metadata files, e.g., $Bitmap and $Quota, are not returned by the MFT enumeration—and that includes the directory $Extend, which is the parent of $RmMetadata.  So when you’re assembling path names from the entries returned in your MFT enumeration you’ll have to account for the fact that $RmMetadata’s parent isn’t going to be in your collection of directories.)

Is the PE Attribute Certificate Table octoword-aligned or octobyte-aligned?

Looking at the Microsoft Portable Executable and Commmon Object File Specification rev 8.2 (Sept 21, 2010)which is the definition of the PE file formatI’m confused about the alignment of the entries in the Attribute Certificate Table.

  • Page 58, section 4.7: “The attribute certificate table is composed of a set of contiguous, octaword-aligned attribute certificate entries.” 
  • Page 59, first paragraph: “Subsequent entries are accessed by advancing that entry’s dwLength bytes, rounded up to an 8-byte multiple, …”
  • Page 59, algorithm step 2: “Round the value from step 1 up to the nearest 8-byte multiple …”
  • Page 59, algorithm step 3: “… and round up to the nearest 8-byte multiple …”
  • Page 60, last paragraph before the bullets: “If the bCertificate does not end on an octaword boundary, the attribute certificate table is padded with zeros, from the end of the bCertificate to the octaword boundary.”

So the documentation is confused.  But, it also clearly says, on page 59, “If the sum of the rounded dwLength values does not equal the Size value, then either the attribute certificate table or the Size field is corrupted.”  And on my sample, signed, executable, the Size field is a multiple of 8 but not 16, and WinVerifyTrust() says that the executable is authentic (and of course the loader will load and execute it).

So on the basis of this experimental evidence (one sample) I think we can conclude that the Attribute Certificate Table is octobyte aligned, not octoword aligned.