From lbrtchx at gmail.com Tue Apr 15 03:11:13 2008 From: lbrtchx at gmail.com (Albretch Mueller) Date: Tue, 15 Apr 2008 05:11:13 -0400 Subject: [Xpath-ng] Does XPath internally access documents via DOM or serially parses them once using SAX? Message-ID: <9ef66fac0804150211g2d2eb82eg7f1f87c296fd4342@mail.gmail.com> I have heard about using XPath for screen scraping code. However, I have always wondered how effective could this possibly be if accessing XML docs via DOM is so taxing (provided XPath internally uses DOM) ~ It would be really easy to declare the parts you would like to scrape from html pages out there using XPath, but it may be very taxing for screen scraper based on a multi-threaded crawler. I am thinking of a crawler that returns typed arrays of objects based on an array of XPath declarations ~ I certainly don't know XPath's guts. Am I making sense? Am I missing something here? Could you point me to some high performance screen scraper implementation (preferably using Java)? ~ Thanks lbrtchx From wayland at wayland.id.au Tue Apr 15 17:27:55 2008 From: wayland at wayland.id.au (Timothy S. Nelson) Date: Wed, 16 Apr 2008 09:27:55 +1000 (EST) Subject: [Xpath-ng] Interesting article Message-ID: Hi all. I've just written a couple of articles which might interest. The first article explains the idea of TreePath. The second article is about TreePath and XPath. http://computerstuff.jdarx.info/content/treepath-universal-tree-navigation-language http://computerstuff.jdarx.info/content/treepath-and-xpath :) --------------------------------------------------------------------- | Name: Tim Nelson | Because the Creator is, | | E-mail: wayland at wayland.id.au | I am | --------------------------------------------------------------------- ----BEGIN GEEK CODE BLOCK---- Version 3.12 GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- PE(+) Y+>++ PGP->+++ R(+) !tv b++ DI++++ D G+ e++>++++ h! y- -----END GEEK CODE BLOCK-----