XPath is usually the perfect language, providing a simple and elegant way to compute values from the content of an XML document. For example, say that you want to count the number of elements. The XPath 1.0 query for this is about as concise as it gets:
1 XPathFactory xPathFactory = XPathFactory.newInstance(); 2 XPath xPath = xPathFactory.newXPath(); 3 long startTime = System.currentTimeMillis(); 4 Double elementCount = (Double) xPath.evaluate("count(//*)", new InputSource(new FileReader("standard.xml")), XPathConstants.NUMBER); 5 long endTime = System.currentTimeMillis(); 6 System.out.println("Element count: " + elementCount.intValue()); 7 System.out.println("Execution time: " + (endTime - startTime) + " millis"); 8 9 // Output 10 // ===== 11 // Element count: 1666315 12 // Execution time: 36989 millis
Code Listing 1: XPath query to count the number of elements.
Choose a large XML document, however, and XPath performance can suffer. For example, try executing the above expression using the 100MB example XML document available for download here. If you do using the Sun JDK 1.6, you’ll need to increase the maximum heap size of the JVM to at least 1GB. This is because Xerces, the default XPath processor included with the Sun JDK, uses a DOM parser to parse the XML document as the first step of evaluating an XPath. Based on my own tests, Xerces takes more than 35 seconds to execute the query on the 100MB XML document, and almost the entire time is spent parsing.
In this large XML document example, one much faster alternative to XPath is to implement a SAX ContentHandler.
1 /** 2 * @author Chip Killmar 3 */ 4 public class CountElementsSAXHandler extends DefaultHandler { 5 private int elementCount; 6 7 @Override 8 public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { 9 ++elementCount; 10 } 11 12 public int getElementCount() { 13 return elementCount; 14 } 15 16 public static void main(String[] args) throws Exception { 17 SAXParserFactory saxParserFactory = SAXParserFactory.newInstance(); 18 SAXParser saxParser = saxParserFactory.newSAXParser(); 19 CountElementsSAXHandler saxCountNamespaces = new CountElementsSAXHandler(); 20 long startTime = System.currentTimeMillis(); 21 saxParser.parse(new InputSource(new FileReader("standard.xml")), saxCountNamespaces); 22 long endTime = System.currentTimeMillis(); 23 System.out.println("Element count: " + saxCountNamespaces.getElementCount()); 24 System.out.println("Execution time: " + (endTime - startTime) + " millis"); 25 26 // Output 27 // ===== 28 // Element count: 1666315 29 // Execution time: 7229 millis 30 } 31 }
Code Listing 2: SAX ContentHandler to count the number of elements.
As you can see, the SAX implementation requires more effort than executing an XPath, but not much. The element counter above executes against the 100MB XML document for me in just over 7 seconds – that’s more than 5 times faster than XPath.