Skip navigation.
Home

xml.sax._exceptions.SAXParseException: junk after document element

Problem

xml.sax._exceptions.SAXParseException: /Users/neil/Documents/workspace/helloworld/src/data/citeseerx/t2.txt:205:0: junk after document element
Traceback (most recent call last):
  File "/Users/neil/Documents/workspace/helloworld/src/data/citeseerx/CiteseerContentHandler.py", line 163, in <module>
    parser.parse(open('/Users/neil/Documents/workspace/helloworld/src/data/citeseerx/t2.txt'))     
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 211, in feed
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/handler.py", line 38, in fatalError

Reason

In my case it is caused by multiple xml documents combined in one file, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2009-05-01T12:07:02+00:00</responseDate>
<request metadataPrefix="oai_dc" verb="ListRecords">http://citeseerx.ist.psu.edu/oai2</request>
<ListRecords>
...
...
...
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2009-05-01T12:07:02+00:00</responseDate>
<request metadataPrefix="oai_dc" verb="ListRecords">http://citeseerx.ist.psu.edu/oai2</request>
<ListRecords>

Solution

Define your own error handler

class ErrorHandler:
    def __init__(self, parser):
        self.parser = parser

    def fatalError(self, msg):
        print msg
        # add you handling here

And then pass it to the parser e.g.

    errorHandler = ErrorHandler(parser)
    parser.setErrorHandler(errorHandler)    
    parser.parse("t2.txt")