Need for Speed |
|
Author: Published: |
After writing Ox, a fast XML parser, a fast JSON parser seems like the next project to tackle. The goal was to come up with a parser that was as fast as possible while still providing a consistent and understandable API. The JSON parser, Oj was so much faster than any of the other JSON parsers that it seemed worth describing the process involved in how those results were achieved.
Iterative Design and Development
The JSON gem set the standard for Ruby JSON handling. The API was not designed for high performance. The first implementation of the Oj gem is compatible with the existing JSON and Yajl gems in general. Some additional Object marshaling was added as well which made the Oj gem more complete but the performance improvement was only about twice as fast as Yajl and JSON::Ext for parsing. Generating JSON was much faster.
Attempting to push the envelope further API changes were considered. The bottleneck of the API is that it forced more use of Ruby code instead of C extensions. Making calls to Ruby Objects to encode themselves is expensive. The automatic Object marshaling of Ruby Objects eliminates this problem but still the JSON API relies heavily on Ruby Objects. Even staying with simple Array and Hash Objects still incurs an overhead when parsing a JSON document.
Going back to basics, a JSON document can be viewed as a self contained entity. There is no need to parse it into Array and Hashes. Instead it can remain as a document with accessors that make getting at the data within the JSON document easy and fast. Moving the JSON document navigation away from Ruby Arrays and Hashes eliminates the overhead of creating those Object that are only necessary for providing structure and organization for the leaf values in the JSON document. That organization can be provided with C structures and the overhead of Array and Hash creation and manipulation removed. Prototyping verified this hypothesis.
The next most significant overhead came with managing the parser object itself. Memory used by the C structures needs to be managed. The suggested approach is to use the Data_Wrap_Struct call and provide a memory cleanup routine. Registering the memory cleanup function is very expensive. Memory management had to be moved away from the normal Ruby GC. Treating the JSON document as a file that gets opened and closed took care of this. A better approach is making use of callback procedures which hides the memory management completely.
By using callback procedures another memory management option became available. Memory could be allocated on the stack instead of on the heap. This gives a further performance improvement.
From experience many users only need a portion of the data in a JSON document. Deferred or lazy conversion of JSON elements to Ruby values would minimize the time needed to get the information a users wanted from a document.
Model
With performance at 20 times faster than any other JSON parser it looked like it was time to shift to providing a reasonable model to pull it all together for a user of the gem. The model that fits best is that the JSON document is a structure that can be opened and navigated to get the information desired. A position marker is used to keep track of the position in the structure. That marker can be moved up, down, and across the JSON structure. Similar to a file system or an XML document a path is used to describe where the marker is moved to. At each location a leaf node where a value exists will have a type and a value. At any given node the type can be asked for without getting a value. If a value is asked for at a non-leaf not an Array or Hash will be created. This is not the optimal way to get leaf values though. That is why the type can be retrieved separately from the value.
The path that describes a location of a leaf can be either absolute from the top of the JSON document or relative to the current position in the JSON structure. An XPath like syntax is used. JSONPath was considered but since it does not support relative paths up the tree it was discarded. JSONPath is also based on XPath but redefines the syntax considerably. For that reason it was not used. A subset of XPath is used instead.
Summary
The Oj gem using the Oj::Doc parser is 20 times faster than either Yajl or JSON::Ext. Even when accessing every leave value in a JSON document it is still over 10 times faster. This demonstrates the need to consider all options including the API when aiming for high performance solutions.
The approach worked so well for the Oj gem that XML parser Ox gem will have to go through a similar performance optimization in the near future.
Note: Tests were run on an 2.8 GHz iMac with a Core i7 CPU and Mac OS X 10.6.8 with Ruby 1.9.3-p125.