Archive for the 'C#' Category

Breaking a Windows command line into separate arguments, respecting quotes and backslashes

I went on a side track recently and discovered the strangely intricate world of breaking a Windows command line into arguments.  That is, how do you do Windows command line lexing?  (By established convention, command line parsing refers to interpreting arguments as options to programs: interpreting flags, collecting file names, handling missing required arguments, etc.)

TL;DR: I wrote a fully tested C# library to do this and it is on github and nuget for your public domain amusement.  (It’s also on symbolsource.org, for your source debugging needs, but I can’t get it to work in my VS2013 environment … let me know if it works for you.)

Most of the time there’s no need to worry about breaking up a command line into arguments.  Your C/C++ program gets them pre-lexed as arguments to main(): the well known argv and argc, handled by your compiler’s runtimes. And your C# program gets a string[] args array, handled by the .NET assembly launcher. And for most occasions, that’s sufficient.

But maybe it isn’t. For example, I was trying to use Clang’s libclang to process some C++ source code. An excellent resource if you want your C++ lexed, parsed, and indexed. But to get it going you’ve got to pass compiler command line arguments to the function which parses a translation unit. Those arguments must include all the include directories, preprocessor symbol definitions, and everything else that you’d ordinarily pass to your compiler (in clang’s case, these are normally gcc’s options). A lot of times these are build into makefile macros or even more difficult to reach locations—like inside of Visual Studio’s project files.

For my purposes I wanted to grab them from MSBuild logfiles so I could get the actual command lines as seen by Visual C++. And that meant, I needed to lex a command line into arguments.

So that turns out to be intricate, as I said above. The key issue is caused by a…unfortunate design choice?…mistake?…that dates back to MS-DOS/PC-DOS 2.0: The use of the backslash as the directory separator character in a path string. Since in C and C-derived languages (and many other languages) the backslash is used as an escape character in a double-quoted string literal, and since paths containing backslashes are often passed as arguments to programs, and since those paths are frequently in double-quoted arguments (to protect blanks and other special characters) there’s a conflict that leads to confusing interactions between quoted arguments and escaped characters and path strings.

In this article on MSDN, Parsing C++ Command Line Arguments, Microsoft describes the rules: note the special cases for even or odd sets of backslashes immediately followed by a double quote character, versus a set of backslashes not so followed. But it’s more complex than that. There is a special rule for backslashes at the end of the string. There is special handling of the first (“zeroth”) argument on the command line: The executable path. The rules changed slightly in 2008. And some programs don’t use Visual C++’s runtime to lex arguments, they use the Windows API CommandLineToArgvW to do it—and wouldn’t you know, it handles things slightly differently.

I ended up writing a C# library that lexed arguments, letting you choose between the Visual C++ way of doing it or the CommandLineToArgvW way of doing it. There are also routines for “requoting” arguments properly so that you can form them back up into a command line. (I haven’t done globbing yet, but that’s coming.) I’ve put it on github (with a public domain license, so party on) and it’s on nuget as well. Bug reports, discussion, praise is all cheerfully accepted at the github site (or as comments here).

Naturally, I didn’t figure out the crafty little details myself. I relied on a reports written by a bunch of people who got there first. And, here are links to that work, which were quite useful to me:

 

 

Biggest mistake in C#: That strings can be null

I really like C#.  It could be better by adding lots of my favorite things … but as it stands it is very useable, very expressive, very readable.  And it has only one major mistake (IMO):  Strings (variables, parameters, fields, etc.) can be null.

Oh my, how many coding errors have been made by forgetting strings could be null?  How many crashes have users suffered?  Oh well.

Anyway, here’s a brief proposal on how to correct the problem.  It isn’t carefully thought through … just off the cuff as it were.  But:

Let there be a unary operator that, when applied to a typed null value (something that isn’t dynamic) acts like this:  if the value is not null then the operator returns that value unchanged; if the value is null then the operator returns the result of calling the no-parameter constructor for the type of the value.  (Where the type of the value is whatever the compiler things it is using standard type inference, where it’s an error if the type doesn’t have a no-parameter constructor, etc.)

(For the sake of argument, assume the operator symbol is a postfixed exclamation point.)

Then you could easily (single character!) coerce any null value to a default constructed value of the proper type.  It would be easy to insert the operator at return statements, or after a method call where you weren’t sure if a null value might be returned, or on the use of a parameter.

And then, the next step is to allow that operator to be used in three more places: After the declared return type of a method, after the declared parameter type of any method parameter, and after the declared type of a property.  (It would work with generic parameter and return types too, if the generic type had the new() constraint.)  This annotation would mean that the compiler would automatically apply the operator at each return statement, and on each annotated parameter on method entry.  The annotation would also be a simple and easily understood way to communicate to the programmer the guarantee that the method never returned null and that, inside the method, the parameters would never be null.

And, with those annotations in place, if you went ahead and modified the IL to incorporate the annotations (rather than just having it as a C# compiler implemented feature) then the JITter could perform flow analysis (of whatever complexity) and probably eliminate a bunch of explicit invocations of the operator.

The final step:  Annotate the .NET framework with the operator where appropriate, which would be practically everywhere.

Well … your input?  Good idea, I’ve neglected a major flaw, or what?