Problem
xml.sax._exceptions.SAXParseException: /Users/neil/Documents/workspace/helloworld/src/data/citeseerx/t2.txt:205:0: junk after document element
Traceback (most recent call last): File "/Users/neil/Documents/workspace/helloworld/src/data/citeseerx/CiteseerContentHandler.py", line 163, in <module> parser.parse(open('/Users/neil/Documents/workspace/helloworld/src/data/citeseerx/t2.txt')) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 107, in parse File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 211, in feed File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/handler.py", line 38, in fatalError
Reason
In my case it is caused by multiple xml documents combined in one file, e.g.:
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2009-05-01T12:07:02+00:00</responseDate> <request metadataPrefix="oai_dc" verb="ListRecords">http://citeseerx.ist.psu.edu/oai2</request> <ListRecords> ... ... ... <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2009-05-01T12:07:02+00:00</responseDate> <request metadataPrefix="oai_dc" verb="ListRecords">http://citeseerx.ist.psu.edu/oai2</request> <ListRecords>
Solution
Define your own error handler
class ErrorHandler: def __init__(self, parser): self.parser = parser def fatalError(self, msg): print msg # add you handling here
And then pass it to the parser e.g.
errorHandler = ErrorHandler(parser) parser.setErrorHandler(errorHandler) parser.parse("t2.txt")