Reading Humongous Files in Go


A real-world performance comparison between Node.js and Go for file parsing unsurprisingly turns out in Go’s favor — but how much?

This blog post was first published on Medium, available via CloudBoost.

Part One

Being fairly new to Go aka. Golang, I was itching to try it out on a real-world problem after having done numerous tutorials and CodeWars katas (don’t know CodeWars? Try it out, it’s really cool). At my workplace, my boss just so happened to find himself in a situation where he needed to parse a humongous XML file — a perfect opportunity to show off my new Golang skills!

The file in question is a UTF-8 encoded XML file from DMR, the Danish Department of Motor Vehicles, containing a more or less complete list of all motor vehicles ever registered in Denmark. We would be downloading this 70 GB file once per week, parsing it and updating our internal vehicle list accordingly.

The first attempt at parsing this file was done in TypeScript/Node.js, just because it’s the language we were using in our daily web development. It was obvious to me that this simply would not do — parsing a file of this size with an interpreted language got to be way slower than an identical algorithm implemented in a compiled language, right? Not only was this a chance to build something in Go, it was also a test to see if Go would be faster than Node.js for this kind of task. So, let’s see how this turned out, shall we?

First, I’d like to emphasize that this is essentially a speed test. Various techniques could be implemented to speed up both the TypeScript and the Go code — such as dividing the work and processing the XML file using asynchronous functions (in Node.js) and channels (in Go). That could be an interesting exercise for another time. For now, we’ll stick to sequential processing.

Both implementations read the XML file line by line, in a loop, using a file reading stream (in Node.js) and a line scanner (in Go). They look for a certain vehicle type, and if found, extract the brand name and model name (eg. Peugeot 208), stores it in a map structure where each combination is kept unique, and finally output all the results to a CSV file in order of appearance, ie. without spending any extra time sorting the list.

Both tests were run from the command line, on the same MacBook Pro, sporting MacOS High Sierra, a 2.5 Ghz Intel Core i7 CPU and 16GB DDR3 RAM. Language versions used were: Node.js 8.9.4 and Go 1.9.2. For both implementations, only built-in packages were used, no third-party stuff.

The Results

As witnessed by the console output below, the Node.js implementation took more than 28 minutes to complete:

> time node lib/app.js
Processing file ended. Lines processed: 1210218597 
real 28m19.509s
user 26m27.223s
sys 1m15.207s

And here are the timings for the Go implementation:

> time dmr-parser — parser=string — infile=full.xml 
Done — CSV data written to out.csv
real 4m14.409s
user 3m4.919s
sys 0m55.094s

The Go implementation processed the same amount of data in little more than 4 minutes!

This is a more than 85% increase in performance over the Node.js implementation, a clear win for Go when it comes to pure file processing power.

Before finishing up this comparison, a few comments are in order. My first implementation in Go is an interesting aspect of the comparison because I initially took a different approach. Instead of doing the same string matching as in the Node.js implementation, I started out by using Go’s “encoding/xml” package to parse the XML into a data structure that I subsequently ran through the vehicle filtering. This was significantly slower — a factor of 2, in fact — than the Node.js implementation. But keep in mind that this would have been an unfair comparison due to the difference in parsing techniques. Such implementations using a proper XML package in both languages are likely to turn into a comparison of package implementations rather than a comparison of pure processing power and might be misleading. I feel that the raw string matching implementations provide the most accurate foundation for a direct comparison, although a comparison of XML parsers would also be an interesting endeavor.

Finally, this comparison highlights a point that I’ve heard many times before, but might be worth repeating: pick the right tool (or language) for the job, not just the one that happens to be your favorite at the time.

Part Two

As pointed out by several commenters on Reddit and Medium (thanks, guys!), the obvious next step would be to determine what could be gained by taking advantage of Go’s tour de force: go routines.

Let me first point out that this will not be a concurrency battle between Go and Node.js. Such a battle has to be orchestrated carefully in order to be a concurrency comparison rather than a comparison of XML libraries, which would have been the case here. And that was never the point of this write-up. If you, dear reader, should happen to discover (or perform) such a comparison, I’d love to hear about it.

My first attempt at a concurrent implementation didn’t pan out, due to my unmarshaling approach — each call to xml.Decoder.Token() returned a reference to a token that would only be valid until the next call, and I needed it to be valid while a worker routine was processing it. After all, what’s the point of delegating work to a pool of workers if they can only chew on the problem one at a time? I finally gave up on that approach when I realized that I could instead utilize the line scanner to extract the correct XML excerpt, and then pass it on to a worker that could then run xml.Unmarshal() on it. This change rewarded me with a greatly simplified implementation — added benefit! It also highlights a point about concurrency: making sure that your work can be divided up properly.

The implementation starts up x workers, where x is configurable. Each worker receives a XML excerpt representing exactly one vehicle stat, parses it and then returns a CSV-delimited string with the findings. The job of making sure that the vehicle list is unique has been redelegated to the main go routine, which simply collects results as they come in, and adds them to a map structure. Finally, the map is iterated and values are written to a CSV file. This overhead is the same for both parsers and therefore shouldn’t impact the performance measurements.

As my testing rig has four CPU cores, and one go routine is dedicated to scanning the original XML file, I’ve decided that the worker count should be a factor of three. This may or may not make sense for you — I’d love to hear some opinions on that. The basic idea is that three workers will provide an optimal environment with little to no task switching necessary on the cores themselves, and that higher factors would provide at least an even distribution of work across cores.

Results Are In!

Let’s get to it, shall we? To set the stakes, I know from my first string parser implementation that it would take four minutes to parse that 70 GB XML file line-by-line, and since we’re still doing that, no solution will beat that time. I ran the two parsers three times each, with three, six and twelve workers. Instead of boring you with CLI evidence from each run, here’s a table with the summary:

Off the bat, we notice that the string parser has actually slowed down somewhat. This was expected. After all, the non-concurrent implementation extracted the required vehicle data in-place, as each line was processed. The concurrent version unloads an excerpt to a worker that iterates over it again and extracts the data. But probably more significantly, the introduction of channels also introduces locking and waiting time where there previously was none. There’s simply not enough work in the string parser to warrant the extra complexity.

The results for the XML parser is a lot more interesting, though. 22–23 minutes to unmarshal that huge hunk of XML — Wow!

As I regrettably didn’t emphasize in the first part of this write-up, the first non-concurrent token-based XML parser that I did in Go roughly took an hour to complete — around 57 minutes, if I remember correctly. The concurrent implementation cuts this number in more than half, by a factor of 2.5. That’s a 60% performance increase over the non-concurrent implementation, and a 26% performance increase over the string-parsing Node.js implementation! That last part is interesting, because I am using a slower mechanism (XML unmarshaling) than the Node.js solution (string parsing), which allows me to take better advantage of concurrent execution, so it ends up being faster.

Note also that adding extra workers beyond the initial three didn’t help — on the contrary, they added up to 1m11s to the execution time.

To conclude, raw file parsing is significantly faster in Go, compared to Node.js. Also, concurrency has real benefit when there is non-trivial work to distribute. And finally, adding more workers to a problem does not reduce processing time equivalently, there’s a balance to be struck.

Code

The Code for Part One

Note that minor parts of the code has been omitted to keep it concise.

Go implementation:

The Code for Part Two

Concurrent main routine:

Concurrent string parser:

Concurrent XML parser: