Tuesday, October 09, 2012

Serializing data

Let's say you are developing Java code, and you have large objects - Maps and Lists - to read through in order to compute some result.
The problem, illustrating the point, was to load in memory all the data and coefficients for a Tidal Computation. Before you start doing any work, you have to read those data from somewhere, load them into your working memory, so they are available when you need them.
Several technologies are available to achieve this goal, the question here is to know which one to go for. Many different data storage techniques could be used. We will compare here XML, Java Serialization, and json (Java Script Object Notation). Notice that we are not here talking about Databases, which is certainly a possible approach. The rationale was to keep the solution as light and self-contained as possible, the data are to be read only (and not modified by any mean). We felt that in that case, the Database approach was overkilling.
XML itself can be approached in different ways, DOM, SAX, or StAX. As we don't wish here to modify any of the data to load, we will use SAX (Simple API for XML).
All the required data are stored into some files, which have to be loaded into objects at runtime.
So, to recapitulate before we get started, we will restrict our comparison to those three approaches:
  • XML, using a SAX parser to load the data
  • Java Serialization/Deserialization
  • Json Serialization/Deserialization
In order to compare what is comparable (apples to apples), we stored the data files into an archive, a zip in this case.
Those data were then deserialized into the exact same objects.
The archive containing the XML data is 1,546,860 bytes big.
The archive containing the Java Serialized (.ser) data is 1,772,159 bytes big.
The archive containing the json data is 2,543,174 bytes big.
We iterated 5 times on the same deserializations, and then did an average.

Here is the code:
  public static void main(String[] args) throws Exception
  {
    long[] elapsed = { 0L, 0L, 0L };
    long before = 0L, after = 0L;
    BackEndTideComputer.setVerbose(false);
    for (int i=0; i<5; i++)
    {
      before = System.currentTimeMillis();
      BackEndTideComputer.connect(BackEndTideComputer.XML_OPTION);
      after = System.currentTimeMillis();
      elapsed[0] += (after - before);
      BackEndTideComputer.disconnect();
      
      before = System.currentTimeMillis();
      BackEndTideComputer.connect(BackEndTideComputer.JAVA_SERIALIZED_OPTION);
      after = System.currentTimeMillis();
      elapsed[1] += (after - before);
      BackEndTideComputer.disconnect();
  
      before = System.currentTimeMillis();
      BackEndTideComputer.connect(BackEndTideComputer.JSON_SERIALIZED_OPTION);
      after = System.currentTimeMillis();
      elapsed[2] += (after - before);
      BackEndTideComputer.disconnect();
    }
    System.out.println("XML:" + Long.toString(elapsed[0] / 5) + " ms" +
                     "  JavaSer:" + Long.toString(elapsed[1] / 5) + " ms" +
                     "  json:" + Long.toString(elapsed[2] / 5) + " ms");
  }
... and here are the results:
 XML:3540 ms  JavaSer:5455 ms  json:4225 ms
If we tune those results to take the data file size in account, by rendering the milliseconds per megabyte, we come up with the following table:

XMLserjson
archive size in bytes1,546,860 1,772,159 2,543,174
elapsed time in ms3,5405,4554,225
average ms/Mb234032281742
The first to cross the line in the SAX XML Parser (I used the one from Oracle).
The best average award (ms/Mb) goes to json (I used Gson, the json parser from Google), but notice that that data we started from are the bigest.
Surprisingly, the red lantern remains the Java Serialization...

A note about JAX-B
We've not talked about JAX-B here, which stands for Java API for XML Binding. It sounds like it could deserve its place in this article... But it actually does not. The JAX-B process starts from an XML Schema, which we don't have in this case (OK, we could have built one. But we did not). This article talks about it.

So, we've been able to compare three approaches, and the "native" one seems to be the worse, which is a bit of a surprise.
This is probably enough to have a better idea of what technology to use to address this kind of problem, but as a bonus, let us take a look at the amount of code required to put them at work.

Java Deserializer

  public static <T> T loadObject(InputStream resource, Class<T> cl) throws Exception 
  {
    T tideObject = null;
    try
    {
      ObjectInputStream ois = new ObjectInputStream(resource);
      tideObject = (T)ois.readObject();
      ois.close();
    }
    catch (Exception ex)
    {
      throw ex;
    }
    return tideObject;
  }

JSON deserializer

  public static <T> T loadObject(InputStream resource, Class<T> cl) throws Exception 
  {
    T tideObject = null;
    try
    {
      BufferedReader br = new BufferedReader(new InputStreamReader(resource));
      StringBuffer sb = new StringBuffer();
      String line = "";
      boolean go = true;
      while (go)
      {
        line = br.readLine();
        if (line == null)
          go = false;
        else
          sb.append(line);
      }
      br.close();
      if (gson == null)
        gson = new GsonBuilder().setPrettyPrinting().create();
      tideObject = gson.fromJson(sb.toString(), cl);        
    }
    catch (Exception ex)
    {
      throw ex;
    }
    return tideObject;
  }

XML SAX Handler

Here is the code starting the SAX process
  public static Map getStationData() throws Exception
  {
    Map stationData = new HashMap();
    StationFinder sf = new StationFinder(stationData);
    try
    {
      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser saxParser = factory.newSAXParser();      
      InputSource is = BackEndTideComputer.getZipInputSource(ARCHIVE_STREAM, STATIONS_ENTRY);
      saxParser.parse(is, sf);       
    }
    catch (Exception ex)
    {
      ex.printStackTrace();
    }
    
    return stationData;
  }
Notice that this code invokes a custom handler, named StationFinder:
  public static class StationFinder extends DefaultHandler
  {
    private String stationName = "";
    private TideStation ts = null;
    private Map stationMap = null;
    
    public void setStationName(String sn)
    {
      this.stationName = sn;
    }
    
    public StationFinder()
    {
    }

    public StationFinder(Map map)
    {
      this.stationMap = map;
    }
    
    public TideStation getTideStation()
    {
      return ts;
    }

    private boolean foundStation        = false;
    private boolean foundNameCollection = false;
    private boolean foundStationData    = false;
    
    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes)
      throws SAXException
    {
//    super.startElement(uri, localName, qName, attributes);
      if (!foundStation && "station".equals(qName))
      {
        String name = attributes.getValue("name");
        if (name.contains(this.stationName))
        {
          foundStation = true;
          ts = new TideStation();
          ts.setFullName(name);
        }
      }
      else if (foundStation)
      {
        if ("name-collection".equals(qName))
        {
          foundNameCollection = true;
        }
        else if ("name-part".equals(qName) && foundNameCollection)
        {
          ts.getNameParts().add(attributes.getValue("name"));
        }
        else if ("position".equals(qName))
        {
          ts.setLatitude(Double.parseDouble(attributes.getValue("latitude")));
          ts.setLongitude(Double.parseDouble(attributes.getValue("longitude")));
        }
        else if ("time-zone".equals(qName))
        {
          ts.setTimeZone(attributes.getValue("name"));
          ts.setTimeOffset(attributes.getValue("offset"));
        }
        else if ("base-height".equals(qName))
        {
          ts.setBaseHeight(Double.parseDouble(attributes.getValue("value")));
          ts.setUnit(attributes.getValue("unit"));
        }
        else if ("station-data".equals(qName))
        {
          foundStationData = true;
        }
        else if (foundStationData && "harmonic-coeff".equals(qName))
        {
          String name = attributes.getValue("name");
          double amplitude = Double.parseDouble(attributes.getValue("amplitude"));
          double epoch     = Double.parseDouble(attributes.getValue("epoch")) * TideUtilities.COEFF_FOR_EPOCH;
          Harmonic h = new Harmonic(name, amplitude, epoch);
          ts.getHarmonics().add(h);
        }
      }
    }
The code snippets speak for themselves... SAX is the most demanding, by far. Also notice that it requires a full knowledge of the data to parse. Java and Json de-serialization do not. As long as the data have been correctly serialized, they will be correctly de-serialized (which brings up the discussion about the famous serialVersionUID in Java..., it's another story).

All the code, including the archives containing the data, is available from Google Code. The specific class displayed above is right here.

No comments:

Post a Comment