XML with Ruby

peter@ohler.com

Author: Peter Ohler Published: Sep 21, 2011

XML is a well documented, widely used, and well supported format for encoding data. It is heavily used on the web and with languages such as Java. XML format is stable and well documented. The downside is it is rather verbose but that is also what makes it readable by humans as well as machines.The wide support in almost every language for XML makes it extremely portable.

Until recently using XML with Ruby was not a high performance option. The two most common and highest performing Ruby Gems were Nokogiri and LibXML. Both provide parsing to Ruby Object representation of an XML document as well as a SAX like stream parser. The Ox gem was created to address the need for a more optimized XML parser so that the advantages of XML could be made available in Ruby without suffering the a performance impact.

The results of performance tests between Nokogiri, LibXML, and Ox were performed under various conditions. Results along with the differences in the way the APIs are used are as follows.

In Memory

It is often easier to deal with a single large XML Object when dealing with XML. One mode supported by all three Ruby XML gems is parsing either a String or a file into an in-memory representation of the parsed document. The first performance tests were done against these APIs for both parsing and dumping back to XML.

XML Document Parsing to an Object

def node_to_dict(element)

doc = Nokogiri::XML::Document.parse(xml)

doc = LibXML::XML::Document.string(xml)

doc = Ox.parse(xml)

Test results taken from perf_gen.rb in the Ox project on GitHub are:

Parsing 1000 times with Ox took 1.895335 seconds.

Nokogiri parse 1000 times took 3.555163 seconds.

LibXML parse 1000 times took 3.668447 seconds.


>>> Ox is 1.9 faster than Nokogiri parsing.

>>> Ox is 1.9 faster than LibXML parsing.

Parsing Results

XML Document to XML String

xml = doc.to_xml(:indent => 2)   # Nokogiri

xml = doc.to_s()                 # LibXML

xml = Ox.dump(doc, :indent => 2) # Ox

Test results taken from perf_gen.rb in the Ox project on GitHub are:

Ox dumping 1000 times with ox took 0.333532 seconds.

Nokogiri to_xml 1000 times took 7.036567 seconds.

LibML to_s 1000 times took 0.668848 seconds.


>>> Ox is 21.1 faster than Nokkgiri to_xml.

>>> Ox is 2.0 faster than LibXML to_xml.

To XML Results

Summary

Gem

Parse

to_s

Nokogiri

3.56 seconds

7.03 seconds

LibXML

3.67 seconds

0.67 seconds

Ox

1.90 seconds

0.33 seconds

Ox was clearly faster parsing and writing XML when compared to Nokogiri and LibXML. Nokogiri was exceptionally poor at writing the XML document.

Nokogiri and LibXML provide more functionality outside of parsing and writing with XPath support and other ancillary features. For raw performance Ox was far better.

Sax Parsing

For large XML documents or for streaming IO a SAX like callback parser is more appropriate. Again, all three Ruby XML gems support SAX like parsing. Two test modes were employed for SAX parsing. Since the whole idea behind a callback parser is to only process the parts of the document of interest to the application a minimal validation parse was tested first followed by a more comprehensive test will stubs for all callbacks.

Validation Sax Parsing

Nokogiri

class NoSax < Nokogiri::XML::SAX::Document

  def error(message); puts message; end

  def warning(message); puts message; end

end

handler = Nokogiri::XML::SAX::Parser.new(NoSax.new())

start = Time.now

$iter.times do

  input = StringIO.new($xml_str)

  handler.parse(input)

  input.close

end

$no_time = Time.now - start

LibXML

class LxSax

  include LibXML::XML::SaxParser::Callbacks

end

start = Time.now

$iter.times do

  input = StringIO.new($xml_str)

  parser = LibXML::XML::SaxParser.io(input)

  parser.callbacks = $all_cbs ? LxAllSax.new() : LxSax.new()

  parser.parse

  input.close

end

$lx_time = Time.now - start

Ox

class OxSax < ::Ox::Sax

  def error(message, line, column); puts message; end

end

start = Time.now

handler = OxSax.new()

$iter.times do

  input = StringIO.new($xml_str)

  Ox.sax_parse(handler, input)

  input.close

end

$ox_time = Time.now - start

Test results taken from perf_sax.rb in the Ox project on GitHub are:

A 1000 KByte XML file was parsed 100 times for this test.

File IO SAX parsing 100 times with Ox took 0.369009 seconds.

File IO SAX parsing 100 times with Nokogiri took 14.637185 seconds.

File IO SAX parsing 100 times with LibXML took 4.913712 seconds.


>>> Ox is 39.7 faster than Nokogiri SAX parsing using file IO.

>>> Ox is 13.3 faster than LibXML SAX parsing using file IO.

SAX Validate Results

In the comprehensive callback tests with all callback methods defined the results for each are:

A 1000 KByte XML file was parsed 100 times for this test.

File IO SAX parsing 100 times with Ox took 1.456263 seconds.

File IO SAX parsing 100 times with Nokogiri took 14.670855 seconds.

File IO SAX parsing 100 times with LibXML took 5.272377 seconds.


>>> Ox is 10.1 faster than Nokogiri SAX parsing using file IO.

>>> Ox is 3.6 faster than LibXML SAX parsing using file IO.

SAX All CBs Results

Summary

Gem

Validate

All CBs

Nokogiri

14.6 seconds

14.7 seconds

LibXML

4.91 seconds

5.27 seconds

Ox

0.37 seconds

1.46 seconds

Ox was significantly faster than Nokogiri and LibXML. The tests highlight another factor that comes into play with Ox. Neither LibXML or Nokogiri gain any performance advantage in ignoring uninteresting parts of an XML document. Comments for example are always processed. With Ox, only the parts of the XML document that are of interest have to be processed in Ruby. This made big difference in performance between the validation SAX parsing and the comprehensive SAX parsing tests. Even with a full set of callbacks Ox is still many times faster than both Nokogiri and LibXML.

Note: Tests were run on an 2.8 GHz iMac with a Core i7 CPU and Mac OS X 10.6.8.