Saturday
Oct102009

Using a networked drive for Time Machine backups (on a Mac)

You'll find similar information to this around the web, but I find it fiddly enough to piece together reliably, and I need it often enough, that I thought I'd blog about it. That way it at least gives me a single place to look. Maybe it will help others too. Much of the specifcs, especially the hdiutil command line and the ifconfig trick, I sourced from this thread in the ReadyNAS forums. Note that the advice is by no means specific to ReadyNAS drives (I have a Thecus NAS myself). Many thanks to btaroli in that thread for the insight.

Time Machine

Time Machine is Apple's easy-to-use backup system, baked into OS X (as of Leopard). Unfortunately it doesn't allow you to back-up to a networked drive out of the box. Enabling this ability is pretty easy. Early on there were some reliability issues - which were largely due to the fact that Time Machine created a disk image (more specifically, a sparse bundle) on the network drive, and this was prone to corruption if the network connection was disrupted during a backup. I don't know if all the issues here have been entirely resolved now, but it does seem more reliable. Apple's own Time Capsule, which has been specifically designed to work with Time Machine, uses this same method, so it is no longer an entirely unsupported technique.

Enabling Time Machine for network drives

So how do you enable backing up to network drives? Open a terminal window and paste the following in (then hit return, of course):
defaults write com.apple.systempreferences TMShowUnsupportedNetworkVolumes 1
Mounted network drives will then show up in the list of destinations available for storing backups.

Getting a working disk image

Unfortunately this is not always enough. Often, after doing this, Time Machine will appear to start preparing a backup then fail with a cryptic error code. The error I have seen is:
Time Machine could not complete the backup.
The backup disk image "/Volumes/backups-1/Wall-E.sparsebundle" could not be created (error 45).
"Error 45"? What's that. If I try to create a sparse image myself in the same location I'm told, "the operation could not be completed". This is not much more helpful. If you google there are many references around to these errors - mostly in forums. Many of them are not terrible helpful, or require a lot of knowledge and/ or patience. I still don't really know what the problem is, although I suspect it's something to do with permissions and/ or attributes. Either way the solution generally seems to be to create the sparse image manually using a command called hdiutil. If you get this right then Time Machine will think it created it and just start using it. Simple eh? Well, it's not rocket science - but it does involve piecing a few things together. The name of the sparse bundle has to be something very specific which is made up from a few pieces of information unique to your set-up. I'll now take you through how to find those pieces of information.

Finding the Computer Name

We'll start with the easy one. The computer name. Specifically this is whatever the computer is named in the Sharing preferences. So open System Preferences, select "Sharing", and copy the name from the "Computer Name" section at the top.

Finding the MAC Address

This is the physical address of your network card (not your IP address, which is a logical address. Also the term "MAC" here is nothing to do with your Mac as a computer - it stands for Media Access Control address). Now you have to be careful here. Most macs these days have at least two network cards! You will probably have an ethernet port (for a network cable connection) as well as wifi. You may also have a USB based device, such as a mobile broadband device. Regardless of which one you use to connect to the network drive you'll be backing up to, the address we need is of the first network card (usually the ethernet port). If this seems a bit odd at first, consider the case where you usually connect over wifi, but to do an initial backup you connect by cable. If the backup name was dependant on the network connection used this wouldn't work. The address is only used to identify your computer. Anyway, it turns out there is an easy way to obtain this. Back in the terminal window, type the following:
ifconfig en0 | grep ether | awk '{print $2}' | sed 's/://g'
What's that doing? The short answer is "don't worry, it works". The slightly longer answer is that ifconfig dumps all the information it has about all it's ethernet ports. The first port is called en0, so the command ifconfig en0 dumps information about just that one. The pipe character, |, is the unix instruction for sending the output of one command to the input of the next. So we send the information from en0 to "grep ether", which filters out just the lines that have the word "ether" in them - which in this case happen to be where the MAC addresses are shown. To get that line into the form we need for our filename we pipe it to another command, awk, which just picks out the second part of the string, then finally to sed, which removes the colons. Phew. Like I said, it just works. Trust me.

Creating the sparsebundle

Now we have the information we need to create the name of the sparsebundle. Following is the instruction you need to issue to create it. Replace the <mac address> and <computer name> placeholders with the information we obtained above. You may need to change the size parameter (320g here) if you have a large drive to back up. The disk image doesn't take up that space to start with, but will grow up to the size you specify here, so use it to set an upper limit. Also you will be prompted to enter your admin password (sudo runs the command as SuperUser):
sudo hdiutil create -size 320g -type SPARSEBUNDLE -nospotlight -volname "Backup of <computer_name>" -fs "Case-sensitive Journaled HFS+" -verbose ~/Desktop/<computer_name>_<mac address>.sparsebundle
Note that this will create the sparsebundle on your desktop. Once there you can copy it to the desired location on your network drive (then delete from your desktop). This seems to be more reliable than creating it in place. Once you've done that you can start Time Machine and point it at the drive where the sparsebundle resides and it will find it and start using it. If this still fails, check that the name is exactly right and that you followed all the steps above carefully. Now sit back and relax, knowing that all your hard work is being backed up.
Monday
Oct052009

Code formatting in C++ Part Two

In the mid nineties I worked at Dr. Solomon's on their Anti-Virus toolkit. I spent some time in the virus labs working with live viruses (which I am told is the correct pluralisation). In those days viruses were mostly DOS based and attached themselves either to an exe image or a disk boot sector. Dr. Solomon's had their own scripting language for describing how a particular virus was identified, and then how it should be removed. This was significant as viruses already had plenty of sophistication with encryption and poly-morphism (each infection looked different). So the guys in the lab would write code in this scripting language on a regular basis.

One thing I noticed as I did my stint in the labs was that the guys who worked there all the time didn't use any indentation. While the script was not really procedural, it did have sections and block scopes, and yet these were never being highlighted in the textual layout of the code. There was nothing in the language that prevented this, so when I wrote my own scripts I indented as I thought best and happily showed my code to the head of the lab. His appraisal?:

We don't use indentation here

I was shocked! Why would you deliberately hide the structure of the code when there was virtually no overhead in bringing it out?
Of course I knew that different people have different ideas about code formatting, but I hadn't come across such an extreme case before.

As my career progressed I learned more and more that the subject of code formatting was very delicate. Developers may grudgingly adopt a "house style" for the sake of consistency (or, increasingly commonly, just adopt the style of the source file they are editing at the time). But ask them to change what they think is best and you'll be lucky to walk away with all your teeth!

Despite this I did pay attention to my own formatting style. Rather than stick to what I'd always done, if I saw a new style I questioned myself on whether there was anything about it that gave it an advantage. If so I adopted the style. For example, when I started out I used the common style of placing spaces on the outside of parentheses, like so:

if (condition==expected)
{
    doSomething (argument);
}

Note the space before the opening (.

Then I saw someone who put the spaces on the inside:

if( condition==expected )
{
    doSomething( argument );
}

This looked really strange and I wondered why he was so keen to depart from the norm. But after a while I realised that, for me at least, I find the second version easier to read. The difference was only slight, of course (or so I thought), but I found that if I was looking at a screenful of code and needed to home in on the interesting bits, having the spaces on the inside of the parentheses helped those parts of the code to come out of the screen at me. Logically, the parentheses belong to the function or keyword preceding it, whereas the arguments or expressions passed in where external and varied independently - so the use of whitespace captured that relationship.
At least that's how I see it.

After a few more years I began to think about whether any sort of objective metrics could be extracted on what aspects of code formatting style enhanced readability - independently of an individual's "preferred" (ie, existing, often ground in) style.

If I could find any such metrics or recommendations they would, I surmised, need to satisfy the following requirements:

They would derive from objective sources that are ideally not connected with software development
They wouldn't necessarily follow my own existing style (ok this isn't a requirement - but if it diverges from my own style it's a good hint)
Other people, picked at random and asked to give the style a try, would come to appreciate it - even if they objected at first

The first requirement is investigated in more detail in the first part of this series - the Speed Reading perspective.

In this article we'll look at the other two.

The best laid code of keyboards and men

Armed with the ideas I'd derived from Speed Reading I decided to tackle the issue of objectively good code formatting styles. This is not to say that it is perfect or that it as truly objective in an absolute sense. But I do believe it has some value. Not least for solving the problem of how to format function signatures consistently.

Before we look at the specifics, I'll address the second and third requirements from the previous section.

They wouldn't necessarily follow my own existing style

This is the case. Although I continue to prefer my spaces-inside-the-parentheses style, and this is compatible, and I'd already had a preference for alignment and columns, the realisation of my ideas took some of that further, as well as into unexpected directions that took some getting used to.

Other people, picked at random and asked to give the style a try, would come to appreciate it - even if they objected at first

As stated early on, the numbers may not be statistically significant, but I have asked a number of developers with difference backgrounds to give the style a fair try. An immediate problem here is that they may have given this style a fairer try than other contenders. A proper study would have introduced control styles too. Nonetheless I found the results illuminating.

Pretty much without exception (at time of writing) everyone who tried it followed the same pattern:

  1. Immediate reaction: "Ugh! That's horrible! Ok, I'll try it, but then I'm going straight back to my old style"
  2. Day 1: Much the same reaction, some regressions, but generally following the style fairly easily, despite personal feelings.
  3. Day 2: "Actually I'm starting to like it!"
  4. Day 2-3: "This is awesome, I'm going to use this style in all my code now"
  5. Day 3+: "I can't stop myself reformatting all my old code to this new style!"
  6. ...
  7. Year 3+: "meh"

We'll come back to the Year 3 effect later. Other than that the general progression is promising, to say the least. However it's by no means conclusive. In addition to the weaknesses already outlined it doesn't really tell us how effective it is (i.e. whether it has a net positive impact on productivity, beyond the initial "feel good" phase). For this I don't have any hard numbers. What I do have is my own feeling, and that of those that tried it, that code readability and navigability improved greatly.

By this point you're probably wondering if I'll ever get to describe the style itself at all. In that case you shall be pleased to know that the next thing I'll cover is just that.

In the next article. See you then!

Technorati Tags: , ,

Thursday
Sep242009

Code formatting in C++ Part One

Code formatting or layout is one of the most religious areas of software development. With so many bloody battles fought and lost most developers learned long ago to avoid the matter altogether. They tend to do this by making consistency the only rule that matters. When in Rome do as the romans do. After all, everyone knows that it's all subjective and doesn't actually matter. Or does it? I'm going to present a couple of views that I hope will lead you looking at the matter once again. Some of them a little cliched. Some a little more novel. First one of the cliches:
"code is written for humans to read - otherwise we'd all be writing assembler."
Of course high level languages are not only about human-readability. They are also about portability and economy of expression, among other things. Nonetheless human-readability is certainly a large part of it. So if we use high level languages in order to make our code more readable, should the layout of that code be irrelevant? Looked at from this perspective we may say, "it should be relevant, but code is also for other humans and its the differences in styles that create the problems - it all balances out to zero". In the context of software development is it worth fighting religious war over a zero-sum game? What if there was a way to come to a consensus? Would there now be some advantage to looking at layout? If so, how much advantage? These are questions that I hope to answer shortly. In the meantime, here's another cliche:
"code formatting is just personal preference. It has no intrinsic value"
Again - is it worth fighting over something that has no value? But what if it did? Sometimes it can be useful to look at the extremes. Consider the following code:
int main(int argc,char*argv[]){printf("hello world\n")}
It's not too difficult spot the familiar "first c++ program" example, even if it's in a less familiar layout. But are you sure it's correct? It shouldn't take too long to spot the bug (and even if you don't the compiler will point it out to you), but now look at the same code expanded to a more canonical form:
int main(int argc, char* argv[])
{
	printf("hello world\n")
}
I would bet that, for the majority of c/c++ developers spotting the missing ; in the second example was faster, and perhaps even more "automatic" than in the first. Even if you agree with that you may wonder, still, how much that matters. We're talking seconds, or perhaps milliseconds, difference in time. And that leads us to another cliche:
"The majority of development time is spent debugging code, rather than writing it"
This is often wheeled out when encouraging use of longer, more descriptive, identifier names, or the use of a more verbose, but explicit, way of doing something. Actually those areas are somewhat related to our topic, but here the point is: if actual coding time is a small part of overall development time, do a few milliseconds here, a second or two there, make any difference (and remember that was a fairly extreme case)? These are all good questions. I'm now going to explore some possible answers for them. What I'm going to present here is my own view, based on my own experience and research, as well as the experience of a number of others that have tried my techniques. However the "number of others" is not statistically relevant enough (nor the conditions controlled) to call it a study, so this remains merely a theory. I encourage you to consider this material and let me know how you get on with it.

How the eyes and brain read

A few years ago I studied (or, more accurately started studying) speed reading. A significant portion of what you learn is understanding how the eye moves across the page, takes information in, and works with the brain to turn this into the experience we know as reading. By understanding these principles we can adjust our reading style to take advantage of their strengths (and play down their weaknesses). As I learnt more I wondered whether the same ideas could be applied to writing as well as reading. It seemed logical that if you write in a way that more closely matches how the eyes and brain read best then reading will be easier, and potentially faster. As I continued studying I found that this the is case. One commonplace example is newspaper and magazine text. This text is arranged in columns as columnar material is easier to digest by the eyes and brain when reading. From here I began to wonder if this knowledge should affect the way we write code. Before we look at my conclusions I'll summarise the key speed reading insights I thought would be relevant: Perhaps most important is that the eyes don't move across the page in a smooth, flowing, manner. Instead they jump in discrete fixations. At each fixation the eyes transmit a chunk of information to the brain. The size and content of each chunk varies and is one of the areas that a speed reader will exercise, attempting to take in more information at each fixation to avoid wasting too much "seek time". An average, untrained (in the speed reading sense), reader will take in about 2-4 words per fixation. For longer words this will decrease, and if the words are unfamiliar may drop below the one-word-per-fixation level. Following from these is the insight that as well as several words along the same line being taken in in a single fixation, multiple lines may be taken in. But that's crazy talk! When you're reading you don't read the lines above and below, do you? (unless you're regressing, which is a bad habit that speed readers try to overcome as soon as possible). Well, speed readers will push the envelope here, but even for the rest of us we will still be taking in more information than we are consciously reading. The smooth, flowing, word-by-word reading experience that we perceive is an illusion. There is actually a disconnect between the information being captured by the eyes, transmitted to the brain, assembled and interpreted, and the perception of words flowing through our conscious minds. We're in danger of getting too deep here. Let's stick with the knowledge that we can take several words horizontally and vertically in at each fixation. We can add to that another counter-intuitive nugget from the speed reading world. Good speed readers don't necessarily scan left-to-right then down the page (assuming a western reading context), but may scan in different directions according to different strategies - e.g., scanning down the page, then back to the top for the next column - even if they then have to mentally reassemble back into the original word order (I never got to this stage). None of this addresses the question of whether this is even worth looking at. Are we trying to solve a problem that doesn't exist? We're constantly warned against premature optimisation but can that apply to our approach to code layout too? Actually we have touched on one relevant principle already. I'll highlight it again here:
The smooth, flowing, word-by-word reading experience that we perceive is an illusion
Why is this relevant? Well for one it tells us that what we think is happening is not necessarily what is actually happening. A lot of processing is occurring before the material we are reading is even presented to us consciously. With practice we can get better at reading unoptimised material - so much so that we are unaware of it. That doesn't mean the processing isn't happening. Processing has a cost associated with it. It tires us - in particular it tires parts of the brain that we tend to use for other programming related activities too, such as solving certain types of problems - or extracting relevant details from a sea of information. It's almost like offloading some heavy number-crunching to a GPU then finding that your refresh rates are suffering. Another factor is the concept of flow. When we are thinking about one problem and we are focused on it we are in a certain flow. The more we are then distracted by the mechanics of the task - consciously or subconsciously - the more it can knock us out of the the flow. Again, we may be so used to this that we hardly notice. Pay attention next time you are stuck in a problem and you find yourself losing track the more you have to hunt around through the code. In summary, we need to get out of the trap of just looking at the numbers (a second here, a few milliseconds there). They may bear little relation to the real factors at play. There's more we can learn from the world of speed reading and eye-brain coordination, but we now have some things to go on that can lead us to conclusions about code formatting.

Code fixation

First, the way eye fixations are able to take in multiple words both horizontally and vertically suggests that islands of related code should be readily assimilated in one or two fixation. Such islands can be created through logical grouping, and effective use of whitespace both to separate from other islands and for alignment purposes. Alignment touches on another eye-brain insight that I've not mentioned yet. Briefly, reading speeds can be enhanced by the use of a guide. This may be a moving object such as a finger or pencil), but just having a hard edge can be helpful too. However, too much of the same thing can make it harder to keep track of where we are, so if the hard edge is too long it loses some of its effectiveness. It follows, therefore, that small blocks with hard edges achieved through alignment should help the eye to more readily distinguish the associations between sections of code. A lot of these are things we already do to some extent. Our use of code blocks and indentation help visually organise code to make it easier to take in - but can we take it further? One good example is blocks of variable declarations. I'll be using C++ as an example here, as that has been my focus in this, but most of what I'm talking about applies to most, if not all, programming languages. I'd argued that you'll notice the difference in C++ more than most. So, here is a typical stretch of variable declarations:
char* txt="hello";
int i = 7;
std::string txt2 = "world";
std::vector<std::string> v;
Already this is organised into a little island. If the declarations were scattered around we would lose that aspect. Whether that makes sense for your application is another matter. I'm not suggesting you lose the benefit of declaring variables closer to where they are used. What I really want to illustrate is what happens if you add a bit of whitespace for alignment purposes:
char*                     txt  = "hello";
int                       i    = 7;
std::string               txt2 = "world";
std::vector<std::string>  v;
Hopefully the use of monospaced font here was enough to preserve the alignment for you. Do let me know if it doesn't and I'll try a different way. Now most of us have probably seen code like this. Maybe we already prefer such a style. But quite a number of developers seem to be against it, either actively (they really dislike it), or at least have concerns over the extra overhead of writing and maintaining in this style. Well, if you're one of that number, please bear with me. There is more to get out of this yet. Also, as we'll see, I believe the big wins are actually in other areas that we're building up to. So let's analyse the properties of this format for a moment. Firstly we have three columns here. The first column contains the types. The second the names and the third the initially assigned values, if any. One of the potential problems here is that, as each column is as wide as the longest field in that column, the more columns we have the more horizontal space we'll end up using. In C++, between templates and namespaces, this can get out of control quickly. As we'll come onto a bit more later, if you have to scroll to take in a line you'll undermine any efforts to make things more readable. Another problem is that it amplifies the objection over the writing and maintenance overhead of such a style. In this simple example we have seven points at which whitespace must be maintained for alignment purposes! A compromise is to use just two columns:
char*                     txt = "hello";
int                       i = 7;
std::string               txt2 = "world";
std::vector<std::string>  v;
This still allows the identifier names to be easily scanned, but at some loss of clarity with respect to the initialised values. Arguably the identifier names are the most important element here (from the perspective of fast lookup), and are the most obfuscated in the original example, so this is still a big improvement. How big? We're getting to that. At this point I just wanted to present some options and examples, with a little rationale. We'll build on these shortly.

For sake of arguments

Blocks of variable declarations are common, but spreading them out through a function is perhaps even more common - making the above examples less relevant. However there are a couple of places where we do still regularly see groups of related variable declarations in once place. One is in class definitions, where member variables are usually grouped in one place. Immediate readability of the names are probably even more important here as we tend to flip back to a class definition in a header file (in the case of C++) just for a moment to get the names. But the other place, and where I'd like to focus further, is function parameter lists. Parameter lists are obviously important. They define the interface between the function (or method - I'll use function here to mean both) and its caller. When looking at a function signature (usually at the prototype in a header file) a caller can see what arguments need to be passed in. However, when looking at the function body the parameter list shows you what has been past in. If these two statements sound obvious then why do we so often neglect how these things are presented - as if they are second class citizens in our code? How often have you seen (or written) a function that takes some number of parameters, where the parameter list is all on one line and spans more than a typical screen-width? - sometimes several screen-widths! Often an attempt is made to rectify the situation by splitting the list over one or more lines, sometimes even one parameter per line (but by no means always). Even then the tendency is to place the first parameter on the same line as the function name, then try to line up the subsequent parameters with the first. Something like this:
void SomeNamespace::SomeDescriptiveClassName::LongishMethodName( int firstParameter,
                                                                 const std::string& secondParameter,
                                                                 const std::vector& thirdParameter,
                                                                 Widget fourthParameter );
Does this look familiar? Although this might look like an extreme example, I don't know about you but I see this sort of code all the time. And remember - this is where someone has made an attempt to split across lines and use alignment! Just today I saw an example on my current project of a function signature that was 239 characters wide - and that was by no means the worst case in the project. And we haven't even added namespaces yet (except for std)! I think you'll agree that this is a big readability issue. The issue is made worse by a lack of consistency which often accompanies such attempts. Sometimes multiple parameters are on one line, sometimes split across several. Sometimes aligned, sometimes not. Even if the developer is following other conventions consistently this one seems to slip through the cracks - probably because it is often not defined and it can be difficult to know exactly what to do in a consistent way. I have a theory about this state of affairs. It's an important theory (even if it turns out not to hold in this case) because it touches on a principle of why people get so religious about code formatting in the first place. I'm going to expand on that a bit later, but for now the theory is this:
Developers can't find a consistent style because the most obvious consistent style seems somehow "wrong"
The underlying principle, which I'll come back to is:
Regardless of what is objectively good, what is familiar always wins out
So what is the one true way when it comes to formatting function signatures? Well, apart from "one true way" being an overstatement, I'm going to leave my specific recommendation for a subsequent article (how else could I get you to come back?)

End of scope

Before I finish for now I want to come back to the issue of how important this all is in the first place. We touched on three areas that I believe are worthy of consideration:
The overhead of reading code "unoptimised" for reading may be more significant than we realise due to the subconscious processes that are at play - not just in time, but in energy and focus.
Even relatively small interruptions to our state of flow can have a noticeable impact on our productivity. Sometimes it can even become a serious bottleneck - think "butterfly effect".
Having a consistent style may actually speed up code writing time, but optimising that style for the way our eye-brain connection works can lead to significant increases in code reading/ comprehension time.
In the next articles I will present my recommendation for function signature formatting and return once again to the question of how much difference it makes, including some anecdotal evidence.
Thursday
Sep242009

Elegant XML parsing with Objective-C

[This article originally appeared on my old metatechnology blog, back in April 2009]

If you write for the Mac you get two Objective-C XML APIs, one tree-based, DOM-like interface, and the other a SAX-like event-driven interface.

If you write for the iPhone you only get the SAX interface. For many purposes this should be all your need.

I was a little disappointed, though, when I first looked at the NSXMLParser class, that it didn't really take advantage of what Objective-C could offer. Probably this was a performance trade-off, as we'll discuss later, but as it is there are only two benefits I can see to using it over a lower level C API: (1) string conversions are done for you and (2) attributes are already set into dictionaries.

I feel you can do so much more, though. So I did!

In this post I'm going to walk through my own wrapper for NSXMLParser, which I developed while writing my iPhone game, vConqr, and at the end give you access to my complete source, which is not terribly large or complex but due to some simplifying assumptions may need some tweaking to meet your needs. I'll bring those assumptions out as we go through.

What's in a name?

To start with, what's wrong with NSXMLParser? Well nothing as far as it goes. It looks like most SAX-like APIs. You provide a delegate object with methods like:

  -(void)    parser: (NSXMLParser*) parser
    didStartElement: (NSString*) elementName
       namespaceURI: (NSString*) namespaceURI
      qualifiedName: (NSString*) qName
         attributes: (NSDictionary*) attributeDict

This is probably the most important method in the interface. This will be called every time the opening tag for a new element is found. As you can see you get passed the parser object itself (which seems a waste, if you wanted it you'd surely just hold a reference at the start), the name of the element, a couple of namespace strings and all the attributes as a dictionary.

Aside from the redundant parser object, I also usually find that for my own small scale, app-specific, xml formats I don't bother with namespaces. If we're wrapping this API we'll probably drop all of those (this is where the first simplifying assumption comes in. If you do need the namespace data it's trivial to extend my code to add it back in - and even make it optional, as we'll see).

That leaves the element name and attributes. It doesn't get much simpler than that, does it? Well think about this for a moment. What's the first thing you're going to do in your handler method? I suspect there's not much you can do that's not element specific (actually there are a few things, but we'll even pull those out shortly). So you're probably going to need to switch on the element name.

Of course, in Objective-C you can't write a switch statement on a string, so you'll end up with something like:

  if( [elementName isEqualToString: @"elephant"] )
  {
    // Do something with elephant elements
  }
  else if( [elementName isEqualToString: @"giraffe"] )
  {
    // Do something with giraffe elements
  }
  else if ( /* next check */ )
  {
    // ...

did you remember not to compare strings with == ? (catches a lot of newer, and sometimes not so new, Objective-C programmers out). Using == would compile but silently fail at runtime (it compares the pointers, not the strings).

So we have needless, repetitive, boilerplate with at least one thing that's easy to get wrong. Furthermore, if this is any more than a couple of checks and a small amount of code in each if block, you'll almost certainly want to forward on to more specialised methods anyway, such as handleElephantElement).

Can we do better? It seems that what we want here is a way to do dynamic dispatch of methods based on names we don't know until runtime. If that's not what a Dynamic Programming Language gives us then I don't know what it is.

Objective-C, or at least NSObject, has a class method called performSelector: that we can use for dynamic dispatch. But performSelector: takes a selector (of type SEL) as it's argument. Can we get a SEL from a string? Yes we can. There's a function called NSSelectorFromString() which does just that! Now, if we build our selector dynamically we can call it - but what happens if we don't implement a handler for every element? We'll get a runtime error, which will usually result in terminate being called. That's a bit harsh. Fortunately the common idiom of calling respondsToSelector: first serves us well here. Putting this all together we get something like:

SEL sel = NSSelectorFromString( [NSString stringWithFormat:@"handleElement_%@:", elementName] );
if( [delegate respondsToSelector:sel] )
{
  [delegate performSelector:sel withObject: attributeDict];
}

Don't forget to include the : at the end of the selector name as you build it (so we can pass the attributes as the, currently, sole argument).

So now, with all that boilerplate pushed to the generic wrapper, we'll be able to write handlers like this:

-(void) handleElement_elephant: (NSDictionary*) attributes;
-(void) handleElement_giraffe: (NSDictionary*) attributes;

which I think you'll agree is much cleaner.

We can do the same for end tags, but the attributes are not needed. I called this method, handleElementEnd_{name}: (where {name} is the name of the element) and used the same techniques.

So, that's elements and attributes handled, but there's one more entity that we need before we can parse any useful documents: Text nodes.

Don't text me, I'll text you

There are a number of aspects of text nodes that make them tricky to deal with in a SAX-like interface.
The first problem is what to do with whitespace? Often, within a text node, whitespace should be preserved - but at the beginning and end you just want it stripped.

Text nodes occur at any point in a document, within the root document and outside of element tags. That means that even just the newlines and indentation spaces you probably have between adjacent element tags will be represented as text nodes.

A common solution to the first issue, which may solve the second too, is to trim whitespace at the extremities (beginning and end), or better still, make it an option. If you do trim, then you need to make sure you don't raise an event for an empty text node. With text nodes trimmed, and empty text nodes suppressed, no events will be raised for formatting whitespace between tags.

The second problem (which you also need to allow for before trimming), is that a single element may contain any number of text nodes. According to most SAX-like API specs (and NSXMLParser is no exception here), there is no guarantee how the text that belongs to an element will be split up, if at all. In practice you can usually count on getting a separate event for text before any child elements, each gap between child elements, and any more text after child elements. If any of those blocks of text are large (for some value of large) it's likely they will be broken down further for memory constraint reasons (imagine that the input processor probably has a buffer it's writing text nodes into. I've certainly implemented a parser exactly that way before).

Attaching any meaning to how text is broken up is probably misguided at best, so before you process your text nodes you will almost certainly want to collate the text nodes into a single string (unless you're expecting very large blocks of text. We'll make the simplifying assumption here that that's not the case).

So, to collect our text nodes we'll need to maintain some state as to the current aggregated text. This problem is complicated if you have a mixture of text and child elements, and the text may appear before or after (or between) child elements. To manage this you'll have to maintain some sort of stack of text blocks, mapped to elements - with unbounded memory requirements.

In practice, mixing child elements and non-whitespace text is uncommon, so one way to simplify this is to ignore the whole problem. If we just take the first, or last, text nodes (ie before or after any potential child elements) then at worst the text will be truncated. For my purposes, where I was completely in control of the schema this was the route I took and is reflected in the code here (any time we see a child node we reset our text buffer). If you want to implement the more general case you'll have to look at the stack-based idea (perhaps a future blog article).

Tracking text nodes this way becomes quite simple. I declare an instance variable to hold the current text value:

NSMutableString* currentTextString;

Now to collect the text we need to implement the method parser:foundCharacters: on our delegate. This will simply append the incoming string to our current string state, or create a new mutable string copy as necessary:

-(void)     parser: (NSXMLParser*) parser 
   foundCharacters: (NSString*) string 
{    
  if( string && [string length] > 0 )
  {
    if( !currentTextString )
    {
      currentTextString = [[NSMutableString alloc] initWithCapacity:4];
    }
    [currentTextString appendString:string];
  }
}

The choice of 4 characters to preallocate was entirely arbitrary and could be tweaked to your needs. Also, you might want to check if this is the first text node being appended, and is entirely whitespace (and whitespace at the ends is being trimmed), then don't even bother creating a new mutable string here just to throw it away. My version is simple and works for my needs.

In order to implement my simplification of throwing away any text before a child node we need to add a bit to parser:didStartElement:namespaceURI:qualifiedName:attributes: to release the string object if it's non-empty:

if( currentTextString )
{
  [currentTextString release];
  currentTextString = nil;
}

Now how do we get the text we've accumulated to the delegate? One way would be to create a new event method, handleText: for example. But you'll almost always want to tie the text up to the current element, so you'll want to track that and pass it to. I decided that this already looks so much like our end element handler that I just made it look like an extended version -

 -(void) handleElementEnd_tiger: (NSDictionary*) attributes 
                       withText: (NSString*) text;

Note that if an end tag is found and there is non-empty text, both events will be fired, if implementations exist, so you can get handleElementEnd_tiger: and handleElementEnd_tiger:withText:

You could use the same technique of looking for two versions of a method, one with more arguments, to optionally pass namespace data, if you like.

So now, with elements, attributes and text nodes sorted we can start doing some basic parsing. However we soon run up against another issue.

In my element

When we write an element (start or end) handler, with our without text nodes, we know our immediate context. We at least know the current element name.

However XML is a hierarchy. While it is certainly possible to write XML such that any given element name can occur at exactly one level of the hierarchy, this is a bit limiting to enforce all the time. For example, in vConqr I have an element called path that contains the vector coordinates of part of the border of a territory. But my borders are split up into three types - internal, external and continent (where continent means an internal border that separates two continents). There are two ways to represent that relationship in XML. One is to make the border type a property of the border element (type="internal" etc). The other is to build it into the name (internalBorder, externalBorder, etc).

I chose to go with the latter because maintaining a stack of element names is easier than maintaining a stack of elements and attributes. To properly implement the element and attributes stack I'd be going down the road of building partial DOM objects in memory - which in the general case is not a direction I wanted to go in.

But just keeping a stack of element names is much simpler, and so I added this to the wrapper class.

I just have an instance variable for the stack:

NSMutableArray* elementStack;

and I push and pop the element names on and off the stack in parser:didStartElement:... and parser:didEndElement:... respectively.

Access is by a simple method:

-(NSString*) ancestorElementAtIndex:(int) index
{
  return [elementStack objectAtIndex: elementStack.count-(index+1)];
}

Now, in my handleElement_path: method. I can call [parser ancestorElementAtIndex:1] and get back the border element that the path belongs to and act appropriately.

Thanks a bundle

With the handler code nicely simplified all that remains is to kick the parser off. This also has a fair bit of boilerplate associated with it. Again, I've made a few assumptions that are appropriate to the way I tend to use this. In particular I'm assuming that I'm targeting the iPhone, and that the XML files in question are in my application bundle. In my next version I have to deal with XML that I download elsewhere too, so I have an alternative version of the parser launching method that handles that.

For now, though, I have loadFromBundle:, which starts with the following code that generates the filename, looks it up in the application bundle and initialises the NSXMLParser with it:

NSString* filename = [NSString stringWithFormat:@"%@.xml", name ];
NSString *path = [[[NSBundle mainBundle] bundlePath]stringByAppendingPathComponent:filename];
	
NSURL* url = [NSURL fileURLWithPath: path];
NSXMLParser* parser = [[NSXMLParser alloc] initWithContentsOfURL:url];
Then some more simplifying assumptions:
[parser setShouldProcessNamespaces:NO];
[parser setShouldReportNamespacePrefixes:NO];
[parser setShouldResolveExternalEntities:NO];

Remember, we're not dealing with namespaces. We're also not interested in referencing external entities.

The wrapper class subclasses NSXMLParser, so the delegate is self (the wrapper maintains it's own delegate).
And we can start the parsing:

parser.delegate = self;
parser parse];

The parser will call back the low level handlers on the subclass, which will translate those to the dynamic handler names we discussed earlier, maintaining the text nodes and element path stack, and keeping the application code simple, elegant and expressive.

Technorati Tags: , , , ,

Thursday
Sep242009

ACCU 2009 Conference

[This article originally appeared on my old metatechnology blog, back in April 2009]

First I'd like to say that the #accu_conf hashtag for the conference was a great success, yielding 11 pages of tweets at the current count. I strongly believe that twitter is becoming more important every day, and that it's power to change the way we look at communication has yet to be fully assessed.

As for the conference programme itself - I can reaffirm my statement that this is the premiere developer's conference in the world! What really sets it apart is that the content is almost exclusively about programming, rather than about specific tools and libraries, and the delegates genuinely do subscribe to the "professionalism in programming" ethic of the ACCU.

To give you a flavour, here's a tweet from "Uncle Bob" (Robert Martin - of Agile Alliance and Object Mentor fame):

This is probably the geekiest conference I've been to. Lots of coding, lots of interesting discussions. Wow
And later,
And now for the long trek home. I wish I could stay, this is a really fun conference.

Sadly, for a combination of reasons, I missed about half the sessions this year - but what I did see were at least as high quality as I have come to expect.

It was especially revealing hearing about the threading support being added to C++0x (already starting to be referred to as C++1x) - then shortly after hearing about the concurrency support being added to D 2.

In the former case the focus was on catching up in terms of the memory model and library primitives. Welcome additions to be sure. However D continues to move forward in ways that only a language not constrained by it's own legacy can do. It's contributions this year expanded on last year's functional support with transitive immutable and const modifiers, to add keywords that mark variables as shared - thus making explicit the communication paths that have been the bane of just about every imperative language before by their implicitness.

D's concurrency was presented to us by Walter Bright (who first designed the language). However, I found it amusing how Andrei Alexandrescu, in both his presentations, appeared to be talking about C++, but held D up as being the true answer to just about every tricky point he raised. More subtle than Russell Winder's continuing "C++ serves no useful purpose" theme, to be sure, but no less damning!

As usual, half of the real content of the conference took place after hours in the hotel bar (or restaurants in Oxford). John Lakos' absence from this ritual was, therefore, all the more noticeable. He arrived on friday, apparently jet-lagged, and had to rush through his 400-odd slides at such a rapidly increasing rate that it appeared his head would explode before he finished it! I can report that his head did survive to see another day, but it wasn't seen in the bar.

Perhaps he felt he wasn't required to put a new puzzle out this year, as there was a cryptography contest going on already in aid of raising funds for Bletchley Park Museum.

In all the message is clear. If you weren't at the conference you missed out on something rather special. Make every preparation now to be there next year.

Technorati Tags: ,