[4suite] [amara] parsing html of various encodings

Uche Ogbuji uche at ogbuji.net
Sat Oct 25 10:40:37 MDT 2008


Just revisiting this old thread to mention that Amara 2.x's HTML parser
handles this no problem.

>>> from amara.bindery import html
>>> doc =
html.parse("http://www.hitimewine.net/istar.asp?a=6&id=161153!1247")
>>> f = doc.xml_select(u'//form[@name="notesinfo"]')[0]
>>> f.xml_children[1]
<input at 0x118e780: name u'input', 0 namespaces, 3 attributes, 0 children>
>>>

node.xml_select(expr) is how you invoke XPath in Amara 2.x

http://wiki.xml3k.org/Amara2
http://wiki.xml3k.org/Amara2/Seven_days/2

--Uche

John Kleven wrote:
> Thanks Dave - that is exactly what i'm looking for.
>
> Except I could use some python bindings.  Looks like Uche and others
> thought about this here:
> http://www.stylusstudio.com/xmldev/200512/post70140.html
>
> I'm going to use BeautifulSoup in the meantime and forget about
> Xpath.  I'm not a huge fan of using commands.getstatusoutput (to
> launch a java process for tagsoup) unless I really have to.
>
> Again, really appreciate the responses.  Only prob is, how is Xpath in
> python really gonna take off without a way to take real world nasty
> html and still succesfully exe xpath requests?
>
> J
>
> On Jan 23, 2008 12:06 AM, Dave Pawson <dave.pawson at gmail.com
> <mailto:dave.pawson at gmail.com>> wrote:
>
>     On 23/01/2008, John Kleven <johnkleven at gmail.com
>     <mailto:johnkleven at gmail.com>> wrote:
>     > Thank you kindly for the response.
>     >
>     > In regards to (1 - nasty html), i'm now running my html through
>     mxTidy, and
>     > even after converting it to xml or xhtml, nasty pages won't
>     parse.  And this
>     > is only the 3rd page i've tried.
>
>     John Cowan has a page on tagsoup - same idea as Tidy, a bit more
>     aggressive though?
>     Google on tagsoup.
>
>
>     HTH
>
>
>
>
>     --
>     Dave Pawson
>     XSLT XSL-FO FAQ.
>     http://www.dpawson.co.uk
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> 4suite mailing list
> 4suite at lists.fourthought.com
> http://lists.fourthought.com/mailman/listinfo/4suite
>   


-- 
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
Linked-in profile: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/



More information about the 4suite mailing list