Entries in style (3)

Wednesday
Nov112009

Code formatting in C++ Part Three

In this article I am going to present my recommendation for a C++ code formatting style (although it applies to most free-formatted languages, especially those that are C/C++ like).

I have covered the background to most of my choices in some detail (some would say too much - but I invoke Blaise Pascal here) in the first two articles of this series, the rather consistently named:

Code formatting in C++ Part One
Code formatting in C++ Part Two
Since the style I am about to present is a little unusual in places, and arbitrary in others, I encourage you to take a look a the previous articles if you have not already done so. Also, where arbitrary looking numbers are used, follow the spirit of the rule rather than the letter (or, in this case, number).

Page width

The proposals below refer often to page width, and by this I mean the number of characters that you would normally expect to be visible while reading and writing code in an editor. For example, it used to be common to keep within 80 characters (or less) due to text mode screen sizes. These days windows can easily be sized to much greater character widths, but I would still recommend adopting a page width of between 80-100 characters. It is not a hard limit, although it is more important in some areas than others. Personally I still try to stick to 80.

Proposal 1: Formatting variable declaration blocks

char*          txt = "hello";
int            i = 7;
std::string    txt2 = "world";
std::vector<std::string>            v;
std::map<std::string, std::string>  m;

Variable declarations should be grouped together where possible (without violating the principle of locality - ie, keeping them close to first use) in "islands" of no more than 16 lines at a time. If there are more than 16 variable declarations in a group then use single lines of whitespace to break them up. Try to keep variables of similar length in the same block.

Within each block, align the variable names as much as possible. Where there is a large variation in type name length, sub-group longer names and shorter names together and align variable names in sub-group blocks instead (note how the vector and map, above, are separated out this way)

This proposal applies both in function body code and within class declarations (and at global scope, if you must).

Proposal 2: Formatting function signatures

Function signatures come in two forms, and we make a distinction here. The first form is the prototype, usually found in a header file, if at all. The second form is part of the definition and is followed by the function body (if applying this to a language without the separate prototype stage, the first form does not exist, of course). We shall start with that:

Function definition signatures

////////////////////////////////////////////////////////////////////////////////
void ClassName::MethodName
(
	char*          txt = "hello",
	int            i = 7,
	std::string    txt2 = "world",
	std::vector<std::string>            v,
	std::map<std::string, std::string>  m
)
const
{
	// ... method body
}

The example here is for a method of a class, but the formatting would be the same for a free function

The return value and function or method name (along with any modifier prefixes - e.g. static, or namespace prefixes) appear on their own line.
Next are the parentheses - both of which appear on their own line - indented to the same level as the preceding line.
Within the parentheses, each on their own lines, are the arguments - formatted according to Proposal 1. Any post-fix modifiers (just const, in this case) appear on their own line, followed by the function or method body.

This is almost certainly the most controversial proposal and I will take up my additional rationale in the next article

If a comment block does not already precede the signature, use a line of forward slashes for about a page width (e.g. I run them up to the 80th column).

An additional point worth mentioning here is that this style lends itself well to being used with the Doxygen "inline comment" method of documenting function and method arguments.

Function prototype signatures

void MethodName
	(	char*          txt,
		int            i,
		std::string    txt2,
		std::vector<std::string>            v,
		std::map<std::string, std::string>  m ) const;

If a separate prototype is required there are some differences to the formatting. This might seem a little odd but I'll provide the rationale in the following article.

First, the parentheses appear in-line with the arguments block, rather than on their own lines. Furthermore the whole block itself is indented with respect to the function name. Finally, any post-fix modifiers (which may include the pure virtual marker here) appear on the same line as the closing parenthesis.

Note that no line of comment characters precedes the signature. Ideally functions and methods would be fully documented at the implementation site and the documentation extracted from comments using a tool such as Doxygen. There are reasons to consider keeping the prototypes clear of too many comments, but obviously you can put them here if you are sure that is best for you

Proposal 3: Function calls

If a function call fits within a normal page width then write it on one line. Long lines should be split across lines according to one of the following two examples:
LongMethodCall1( "some text",
                  aString,
                  anInt,
                  anotherArgument );
string returnVal = ReallyLongMethodNameCall
		( "some text",
		  aString,
		  anInt,
		  anotherArgument );

In both cases the arguments are aligned, one line each, with parentheses on the first and last lines. The first line should share the line with the function or method name itself, unless that would push the argument list across such that any of the arguments end up beyond the page width

General Principles

The proposals above are deliberately narrow in focus, concentrating on those areas that are often left out of standards, or not sufficiently described, and where using an ad-hoc approach is often less than satisfactory. However there are some simple emergent themes that can be carried through to other areas of code:

  • Types and identifier names are separated into columns through alignment
  • Code is kept within a page width where possible. This is especially significant for code that you need to refer to at a glance, such as function prototypes - and is often where it is most overlooked!
  • For "structural" code, such as function signatures, consistency is especially important, even in places where it seems unnecessary (e.g. splitting short function signatures across multiple lines - even empty constructors). For implementation code the choice of when to split can be guided by the page width.

I have made some recommendations in these proposals that are not specifically backed by the discussion in the previous articles. I will attempt to cover these in the next, and final, article in this series.

Monday
Oct052009

Code formatting in C++ Part Two

In the mid nineties I worked at Dr. Solomon's on their Anti-Virus toolkit. I spent some time in the virus labs working with live viruses (which I am told is the correct pluralisation). In those days viruses were mostly DOS based and attached themselves either to an exe image or a disk boot sector. Dr. Solomon's had their own scripting language for describing how a particular virus was identified, and then how it should be removed. This was significant as viruses already had plenty of sophistication with encryption and poly-morphism (each infection looked different). So the guys in the lab would write code in this scripting language on a regular basis.

One thing I noticed as I did my stint in the labs was that the guys who worked there all the time didn't use any indentation. While the script was not really procedural, it did have sections and block scopes, and yet these were never being highlighted in the textual layout of the code. There was nothing in the language that prevented this, so when I wrote my own scripts I indented as I thought best and happily showed my code to the head of the lab. His appraisal?:

We don't use indentation here

I was shocked! Why would you deliberately hide the structure of the code when there was virtually no overhead in bringing it out?
Of course I knew that different people have different ideas about code formatting, but I hadn't come across such an extreme case before.

As my career progressed I learned more and more that the subject of code formatting was very delicate. Developers may grudgingly adopt a "house style" for the sake of consistency (or, increasingly commonly, just adopt the style of the source file they are editing at the time). But ask them to change what they think is best and you'll be lucky to walk away with all your teeth!

Despite this I did pay attention to my own formatting style. Rather than stick to what I'd always done, if I saw a new style I questioned myself on whether there was anything about it that gave it an advantage. If so I adopted the style. For example, when I started out I used the common style of placing spaces on the outside of parentheses, like so:

if (condition==expected)
{
    doSomething (argument);
}

Note the space before the opening (.

Then I saw someone who put the spaces on the inside:

if( condition==expected )
{
    doSomething( argument );
}

This looked really strange and I wondered why he was so keen to depart from the norm. But after a while I realised that, for me at least, I find the second version easier to read. The difference was only slight, of course (or so I thought), but I found that if I was looking at a screenful of code and needed to home in on the interesting bits, having the spaces on the inside of the parentheses helped those parts of the code to come out of the screen at me. Logically, the parentheses belong to the function or keyword preceding it, whereas the arguments or expressions passed in where external and varied independently - so the use of whitespace captured that relationship.
At least that's how I see it.

After a few more years I began to think about whether any sort of objective metrics could be extracted on what aspects of code formatting style enhanced readability - independently of an individual's "preferred" (ie, existing, often ground in) style.

If I could find any such metrics or recommendations they would, I surmised, need to satisfy the following requirements:

They would derive from objective sources that are ideally not connected with software development
They wouldn't necessarily follow my own existing style (ok this isn't a requirement - but if it diverges from my own style it's a good hint)
Other people, picked at random and asked to give the style a try, would come to appreciate it - even if they objected at first

The first requirement is investigated in more detail in the first part of this series - the Speed Reading perspective.

In this article we'll look at the other two.

The best laid code of keyboards and men

Armed with the ideas I'd derived from Speed Reading I decided to tackle the issue of objectively good code formatting styles. This is not to say that it is perfect or that it as truly objective in an absolute sense. But I do believe it has some value. Not least for solving the problem of how to format function signatures consistently.

Before we look at the specifics, I'll address the second and third requirements from the previous section.

They wouldn't necessarily follow my own existing style

This is the case. Although I continue to prefer my spaces-inside-the-parentheses style, and this is compatible, and I'd already had a preference for alignment and columns, the realisation of my ideas took some of that further, as well as into unexpected directions that took some getting used to.

Other people, picked at random and asked to give the style a try, would come to appreciate it - even if they objected at first

As stated early on, the numbers may not be statistically significant, but I have asked a number of developers with difference backgrounds to give the style a fair try. An immediate problem here is that they may have given this style a fairer try than other contenders. A proper study would have introduced control styles too. Nonetheless I found the results illuminating.

Pretty much without exception (at time of writing) everyone who tried it followed the same pattern:

  1. Immediate reaction: "Ugh! That's horrible! Ok, I'll try it, but then I'm going straight back to my old style"
  2. Day 1: Much the same reaction, some regressions, but generally following the style fairly easily, despite personal feelings.
  3. Day 2: "Actually I'm starting to like it!"
  4. Day 2-3: "This is awesome, I'm going to use this style in all my code now"
  5. Day 3+: "I can't stop myself reformatting all my old code to this new style!"
  6. ...
  7. Year 3+: "meh"

We'll come back to the Year 3 effect later. Other than that the general progression is promising, to say the least. However it's by no means conclusive. In addition to the weaknesses already outlined it doesn't really tell us how effective it is (i.e. whether it has a net positive impact on productivity, beyond the initial "feel good" phase). For this I don't have any hard numbers. What I do have is my own feeling, and that of those that tried it, that code readability and navigability improved greatly.

By this point you're probably wondering if I'll ever get to describe the style itself at all. In that case you shall be pleased to know that the next thing I'll cover is just that.

In the next article. See you then!

Technorati Tags: , ,

Thursday
Sep242009

Code formatting in C++ Part One

Code formatting or layout is one of the most religious areas of software development. With so many bloody battles fought and lost most developers learned long ago to avoid the matter altogether. They tend to do this by making consistency the only rule that matters. When in Rome do as the romans do. After all, everyone knows that it's all subjective and doesn't actually matter. Or does it? I'm going to present a couple of views that I hope will lead you looking at the matter once again. Some of them a little cliched. Some a little more novel. First one of the cliches:
"code is written for humans to read - otherwise we'd all be writing assembler."
Of course high level languages are not only about human-readability. They are also about portability and economy of expression, among other things. Nonetheless human-readability is certainly a large part of it. So if we use high level languages in order to make our code more readable, should the layout of that code be irrelevant? Looked at from this perspective we may say, "it should be relevant, but code is also for other humans and its the differences in styles that create the problems - it all balances out to zero". In the context of software development is it worth fighting religious war over a zero-sum game? What if there was a way to come to a consensus? Would there now be some advantage to looking at layout? If so, how much advantage? These are questions that I hope to answer shortly. In the meantime, here's another cliche:
"code formatting is just personal preference. It has no intrinsic value"
Again - is it worth fighting over something that has no value? But what if it did? Sometimes it can be useful to look at the extremes. Consider the following code:
int main(int argc,char*argv[]){printf("hello world\n")}
It's not too difficult spot the familiar "first c++ program" example, even if it's in a less familiar layout. But are you sure it's correct? It shouldn't take too long to spot the bug (and even if you don't the compiler will point it out to you), but now look at the same code expanded to a more canonical form:
int main(int argc, char* argv[])
{
	printf("hello world\n")
}
I would bet that, for the majority of c/c++ developers spotting the missing ; in the second example was faster, and perhaps even more "automatic" than in the first. Even if you agree with that you may wonder, still, how much that matters. We're talking seconds, or perhaps milliseconds, difference in time. And that leads us to another cliche:
"The majority of development time is spent debugging code, rather than writing it"
This is often wheeled out when encouraging use of longer, more descriptive, identifier names, or the use of a more verbose, but explicit, way of doing something. Actually those areas are somewhat related to our topic, but here the point is: if actual coding time is a small part of overall development time, do a few milliseconds here, a second or two there, make any difference (and remember that was a fairly extreme case)? These are all good questions. I'm now going to explore some possible answers for them. What I'm going to present here is my own view, based on my own experience and research, as well as the experience of a number of others that have tried my techniques. However the "number of others" is not statistically relevant enough (nor the conditions controlled) to call it a study, so this remains merely a theory. I encourage you to consider this material and let me know how you get on with it.

How the eyes and brain read

A few years ago I studied (or, more accurately started studying) speed reading. A significant portion of what you learn is understanding how the eye moves across the page, takes information in, and works with the brain to turn this into the experience we know as reading. By understanding these principles we can adjust our reading style to take advantage of their strengths (and play down their weaknesses). As I learnt more I wondered whether the same ideas could be applied to writing as well as reading. It seemed logical that if you write in a way that more closely matches how the eyes and brain read best then reading will be easier, and potentially faster. As I continued studying I found that this the is case. One commonplace example is newspaper and magazine text. This text is arranged in columns as columnar material is easier to digest by the eyes and brain when reading. From here I began to wonder if this knowledge should affect the way we write code. Before we look at my conclusions I'll summarise the key speed reading insights I thought would be relevant: Perhaps most important is that the eyes don't move across the page in a smooth, flowing, manner. Instead they jump in discrete fixations. At each fixation the eyes transmit a chunk of information to the brain. The size and content of each chunk varies and is one of the areas that a speed reader will exercise, attempting to take in more information at each fixation to avoid wasting too much "seek time". An average, untrained (in the speed reading sense), reader will take in about 2-4 words per fixation. For longer words this will decrease, and if the words are unfamiliar may drop below the one-word-per-fixation level. Following from these is the insight that as well as several words along the same line being taken in in a single fixation, multiple lines may be taken in. But that's crazy talk! When you're reading you don't read the lines above and below, do you? (unless you're regressing, which is a bad habit that speed readers try to overcome as soon as possible). Well, speed readers will push the envelope here, but even for the rest of us we will still be taking in more information than we are consciously reading. The smooth, flowing, word-by-word reading experience that we perceive is an illusion. There is actually a disconnect between the information being captured by the eyes, transmitted to the brain, assembled and interpreted, and the perception of words flowing through our conscious minds. We're in danger of getting too deep here. Let's stick with the knowledge that we can take several words horizontally and vertically in at each fixation. We can add to that another counter-intuitive nugget from the speed reading world. Good speed readers don't necessarily scan left-to-right then down the page (assuming a western reading context), but may scan in different directions according to different strategies - e.g., scanning down the page, then back to the top for the next column - even if they then have to mentally reassemble back into the original word order (I never got to this stage). None of this addresses the question of whether this is even worth looking at. Are we trying to solve a problem that doesn't exist? We're constantly warned against premature optimisation but can that apply to our approach to code layout too? Actually we have touched on one relevant principle already. I'll highlight it again here:
The smooth, flowing, word-by-word reading experience that we perceive is an illusion
Why is this relevant? Well for one it tells us that what we think is happening is not necessarily what is actually happening. A lot of processing is occurring before the material we are reading is even presented to us consciously. With practice we can get better at reading unoptimised material - so much so that we are unaware of it. That doesn't mean the processing isn't happening. Processing has a cost associated with it. It tires us - in particular it tires parts of the brain that we tend to use for other programming related activities too, such as solving certain types of problems - or extracting relevant details from a sea of information. It's almost like offloading some heavy number-crunching to a GPU then finding that your refresh rates are suffering. Another factor is the concept of flow. When we are thinking about one problem and we are focused on it we are in a certain flow. The more we are then distracted by the mechanics of the task - consciously or subconsciously - the more it can knock us out of the the flow. Again, we may be so used to this that we hardly notice. Pay attention next time you are stuck in a problem and you find yourself losing track the more you have to hunt around through the code. In summary, we need to get out of the trap of just looking at the numbers (a second here, a few milliseconds there). They may bear little relation to the real factors at play. There's more we can learn from the world of speed reading and eye-brain coordination, but we now have some things to go on that can lead us to conclusions about code formatting.

Code fixation

First, the way eye fixations are able to take in multiple words both horizontally and vertically suggests that islands of related code should be readily assimilated in one or two fixation. Such islands can be created through logical grouping, and effective use of whitespace both to separate from other islands and for alignment purposes. Alignment touches on another eye-brain insight that I've not mentioned yet. Briefly, reading speeds can be enhanced by the use of a guide. This may be a moving object such as a finger or pencil), but just having a hard edge can be helpful too. However, too much of the same thing can make it harder to keep track of where we are, so if the hard edge is too long it loses some of its effectiveness. It follows, therefore, that small blocks with hard edges achieved through alignment should help the eye to more readily distinguish the associations between sections of code. A lot of these are things we already do to some extent. Our use of code blocks and indentation help visually organise code to make it easier to take in - but can we take it further? One good example is blocks of variable declarations. I'll be using C++ as an example here, as that has been my focus in this, but most of what I'm talking about applies to most, if not all, programming languages. I'd argued that you'll notice the difference in C++ more than most. So, here is a typical stretch of variable declarations:
char* txt="hello";
int i = 7;
std::string txt2 = "world";
std::vector<std::string> v;
Already this is organised into a little island. If the declarations were scattered around we would lose that aspect. Whether that makes sense for your application is another matter. I'm not suggesting you lose the benefit of declaring variables closer to where they are used. What I really want to illustrate is what happens if you add a bit of whitespace for alignment purposes:
char*                     txt  = "hello";
int                       i    = 7;
std::string               txt2 = "world";
std::vector<std::string>  v;
Hopefully the use of monospaced font here was enough to preserve the alignment for you. Do let me know if it doesn't and I'll try a different way. Now most of us have probably seen code like this. Maybe we already prefer such a style. But quite a number of developers seem to be against it, either actively (they really dislike it), or at least have concerns over the extra overhead of writing and maintaining in this style. Well, if you're one of that number, please bear with me. There is more to get out of this yet. Also, as we'll see, I believe the big wins are actually in other areas that we're building up to. So let's analyse the properties of this format for a moment. Firstly we have three columns here. The first column contains the types. The second the names and the third the initially assigned values, if any. One of the potential problems here is that, as each column is as wide as the longest field in that column, the more columns we have the more horizontal space we'll end up using. In C++, between templates and namespaces, this can get out of control quickly. As we'll come onto a bit more later, if you have to scroll to take in a line you'll undermine any efforts to make things more readable. Another problem is that it amplifies the objection over the writing and maintenance overhead of such a style. In this simple example we have seven points at which whitespace must be maintained for alignment purposes! A compromise is to use just two columns:
char*                     txt = "hello";
int                       i = 7;
std::string               txt2 = "world";
std::vector<std::string>  v;
This still allows the identifier names to be easily scanned, but at some loss of clarity with respect to the initialised values. Arguably the identifier names are the most important element here (from the perspective of fast lookup), and are the most obfuscated in the original example, so this is still a big improvement. How big? We're getting to that. At this point I just wanted to present some options and examples, with a little rationale. We'll build on these shortly.

For sake of arguments

Blocks of variable declarations are common, but spreading them out through a function is perhaps even more common - making the above examples less relevant. However there are a couple of places where we do still regularly see groups of related variable declarations in once place. One is in class definitions, where member variables are usually grouped in one place. Immediate readability of the names are probably even more important here as we tend to flip back to a class definition in a header file (in the case of C++) just for a moment to get the names. But the other place, and where I'd like to focus further, is function parameter lists. Parameter lists are obviously important. They define the interface between the function (or method - I'll use function here to mean both) and its caller. When looking at a function signature (usually at the prototype in a header file) a caller can see what arguments need to be passed in. However, when looking at the function body the parameter list shows you what has been past in. If these two statements sound obvious then why do we so often neglect how these things are presented - as if they are second class citizens in our code? How often have you seen (or written) a function that takes some number of parameters, where the parameter list is all on one line and spans more than a typical screen-width? - sometimes several screen-widths! Often an attempt is made to rectify the situation by splitting the list over one or more lines, sometimes even one parameter per line (but by no means always). Even then the tendency is to place the first parameter on the same line as the function name, then try to line up the subsequent parameters with the first. Something like this:
void SomeNamespace::SomeDescriptiveClassName::LongishMethodName( int firstParameter,
                                                                 const std::string& secondParameter,
                                                                 const std::vector& thirdParameter,
                                                                 Widget fourthParameter );
Does this look familiar? Although this might look like an extreme example, I don't know about you but I see this sort of code all the time. And remember - this is where someone has made an attempt to split across lines and use alignment! Just today I saw an example on my current project of a function signature that was 239 characters wide - and that was by no means the worst case in the project. And we haven't even added namespaces yet (except for std)! I think you'll agree that this is a big readability issue. The issue is made worse by a lack of consistency which often accompanies such attempts. Sometimes multiple parameters are on one line, sometimes split across several. Sometimes aligned, sometimes not. Even if the developer is following other conventions consistently this one seems to slip through the cracks - probably because it is often not defined and it can be difficult to know exactly what to do in a consistent way. I have a theory about this state of affairs. It's an important theory (even if it turns out not to hold in this case) because it touches on a principle of why people get so religious about code formatting in the first place. I'm going to expand on that a bit later, but for now the theory is this:
Developers can't find a consistent style because the most obvious consistent style seems somehow "wrong"
The underlying principle, which I'll come back to is:
Regardless of what is objectively good, what is familiar always wins out
So what is the one true way when it comes to formatting function signatures? Well, apart from "one true way" being an overstatement, I'm going to leave my specific recommendation for a subsequent article (how else could I get you to come back?)

End of scope

Before I finish for now I want to come back to the issue of how important this all is in the first place. We touched on three areas that I believe are worthy of consideration:
The overhead of reading code "unoptimised" for reading may be more significant than we realise due to the subconscious processes that are at play - not just in time, but in energy and focus.
Even relatively small interruptions to our state of flow can have a noticeable impact on our productivity. Sometimes it can even become a serious bottleneck - think "butterfly effect".
Having a consistent style may actually speed up code writing time, but optimising that style for the way our eye-brain connection works can lead to significant increases in code reading/ comprehension time.
In the next articles I will present my recommendation for function signature formatting and return once again to the question of how much difference it makes, including some anecdotal evidence.