[4suite] Aborting SAX document parsing
Sidney
csidney79 at gmail.com
Thu Aug 30 11:42:47 MDT 2007
I believe this is a limitation of Expat, but more likely, just XML in
general. I believe this question has been raised before in other forums and
lists about XML parsing. There is no 'graceful' way of stopping the Expat
parser, from what I know, except raising an exception.
The problem is with XML's strict requirement for syntax correctness. A XML
document must be valid to be called "XML". So, when an XML parser is an
invoked, it "logically does not make sense" to be able to stop the parsing
in the middle of a document without ensuring the rest of the document is
also valid. If the rest of the document is not valid, then it really isn't
XML. For example, consider this document:
<?xml version='1.0'?>
<root>
<a>
<b>Hello, World</b>
</c><!-- invalid xml -->
</root>
If all you wanted was the content in node 'b', once you obtained it with
your sax content handler, what is the correct thing to do now? How do you
'gracefully' exit parsing without ensuring the rest of the document is valid
XML? Even if you know for sure that document is correct XML, the parser
doesn't know. It is also not good practice to assume your documents are
correct XML. So, if you really did want to stop, then raising an exception
is the "correct way" because the parser has not guaranteed that the document
is really valid XML.
Take a look at Uche's Amara XML toolkit. You will be interested in the
"pushbind" interface as described here:
http://uche.ogbuji.net/tech/4suite/etc/amara-manual.html#pushbind
It will simplify what it is you are trying to do. Behind the scenes, it is
using SAX (with the help of threads). But again, the parser doesn't stop at
the instance you've found what you want unless an exception is raised.
Another alternative if you are working with very large and persistent XML
documents is to use a 'true' XML database. A 'true' XML database is one
that does not store your document as a 'string' in a table, but actually
breaks your document a part and stores it in some object type form so that
it is optimized for querying - that is, the database doesn't need to parse
the entire document every time it is queried. With a true XML database, you
will be able to query it for exactly the nodes you want only.
Hopefully my suggestions are useful. Best of luck!
-----Original Message-----
From: Simone Leo [mailto:simleo at tiscali.it]
Sent: Thursday, August 30, 2007 7:54 AM
To: Sidney
Cc: 4suite at lists.fourthought.com
Subject: Re: [4suite] Aborting SAX document parsing
Sidney wrote:
> Just set the parser instance's content handler to 'None'.
>
Thanks for the suggestion, but this is not enough for what I'm trying to
achieve. I need the parser to immediately stop reading and return
control to the main script. Setting the content handler to None just
ensures that the (old) handler does not receive any more events, but the
parser keeps on reading.
Simone
> from Ft.Xml import InputSource, Sax
>
> class MyHandler(Sax.ContentHandler):
>
> def __init__(self, parser):
> self.parser = parser
>
> def endElementNS(self, name, qname):
> if qname == 'MyLastElement':
> self.parser.setContentHandler(None)
>
> [...]
>
> parser = Sax.CreateParser(False)
> handler = MyHandler(parser)
> parser.setContentHandler(handler)
> parser.parse(stream)
>
> -----Original Message-----
> From: 4suite-bounces at lists.fourthought.com
> [mailto:4suite-bounces at lists.fourthought.com] On Behalf Of Simone Leo
> Sent: Tuesday, August 28, 2007 5:40 AM
> To: 4suite at lists.fourthought.com
> Subject: [4suite] Aborting SAX document parsing
>
> I need to get data from large remote XML documents. Since the data I'm
> interested in resides in the topmost elements, I decided to use SAX,
> figuring I would be able to abort parsing as soon as it reached the
> closing tag of the last element I needed. In this way, if you read
> directly from the socket, you don't even have to download the rest of
> the document.
>
> Unfortunately I wasn't able to find any clean way to stop the parser
> from reading through the whole document other than the following hack:
>
> from Ft.Xml import InputSource, Sax
>
> class StopParsing(Exception):
> pass
>
> class MyHandler(object):
>
> [...]
>
> def endElementNS(self, name, qname):
> if qname == 'MyLastElement':
> raise StopParsing
>
> [...]
>
> parser = Sax.CreateParser(False) # No external DTD
> handler = MyHandler()
> parser.setContentHandler(handler)
> try:
> parser.parse(stream)
> except StopParsing:
> pass
>
> Is there a cleaner way to perform this trick?
>
> Thanks in advance,
>
> Simone Leo
> _______________________________________________
> 4suite mailing list
> 4suite at lists.fourthought.com
> http://lists.fourthought.com/mailman/listinfo/4suite
>
>
More information about the 4suite
mailing list