[4suite] [amara] parsing html of various encodings
Uche Ogbuji
uche at ogbuji.net
Sat Oct 25 10:40:37 MDT 2008
Just revisiting this old thread to mention that Amara 2.x's HTML parser
handles this no problem.
>>> from amara.bindery import html
>>> doc =
html.parse("http://www.hitimewine.net/istar.asp?a=6&id=161153!1247")
>>> f = doc.xml_select(u'//form[@name="notesinfo"]')[0]
>>> f.xml_children[1]
<input at 0x118e780: name u'input', 0 namespaces, 3 attributes, 0 children>
>>>
node.xml_select(expr) is how you invoke XPath in Amara 2.x
http://wiki.xml3k.org/Amara2
http://wiki.xml3k.org/Amara2/Seven_days/2
--Uche
John Kleven wrote:
> Thanks Dave - that is exactly what i'm looking for.
>
> Except I could use some python bindings. Looks like Uche and others
> thought about this here:
> http://www.stylusstudio.com/xmldev/200512/post70140.html
>
> I'm going to use BeautifulSoup in the meantime and forget about
> Xpath. I'm not a huge fan of using commands.getstatusoutput (to
> launch a java process for tagsoup) unless I really have to.
>
> Again, really appreciate the responses. Only prob is, how is Xpath in
> python really gonna take off without a way to take real world nasty
> html and still succesfully exe xpath requests?
>
> J
>
> On Jan 23, 2008 12:06 AM, Dave Pawson <dave.pawson at gmail.com
> <mailto:dave.pawson at gmail.com>> wrote:
>
> On 23/01/2008, John Kleven <johnkleven at gmail.com
> <mailto:johnkleven at gmail.com>> wrote:
> > Thank you kindly for the response.
> >
> > In regards to (1 - nasty html), i'm now running my html through
> mxTidy, and
> > even after converting it to xml or xhtml, nasty pages won't
> parse. And this
> > is only the 3rd page i've tried.
>
> John Cowan has a page on tagsoup - same idea as Tidy, a bit more
> aggressive though?
> Google on tagsoup.
>
>
> HTH
>
>
>
>
> --
> Dave Pawson
> XSLT XSL-FO FAQ.
> http://www.dpawson.co.uk
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> 4suite mailing list
> 4suite at lists.fourthought.com
> http://lists.fourthought.com/mailman/listinfo/4suite
>
--
Uche Ogbuji http://uche.ogbuji.net
Founding Partner, Zepheira http://zepheira.com
Linked-in profile: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
More information about the 4suite
mailing list