[4suite] [Amara] performance of pushbind on large files?

Uche Ogbuji uche at ogbuji.net
Fri Dec 22 10:01:34 MST 2006


Robert Casties wrote:
> I have a little program that reads FileMaker FMPXMLRESULT files (one of
> the worst XML formats I've seen) and writes the data into a database.
> 
> Because the files are rather big (340MB) I wrote the first version of my
> program using Python pulldom. The result was not bad but it still takes
> 80 minutes (on a 1.6GHz Mac G5) to churn through 190,000 ROW elements
> with 86 COL elements each.
> 
> So I thought maybe amara.pushbind (or pushdom) would do better and wrote
> a version of the program using pushbind. The resulting program runs
> nicely on small files (50 ROWs) but it takes forever to run on big files
> (190k ROWs). The program essentially stops at the first call of
> 
> for f in amara.pushbind(filename, u'fm:METADATA/fm:FIELD', prefixes=fm_ns):
>     fn = f.NAME
> 
> for 30 minutes, using almost full CPU (while the METADATA tag is at the
> beginning of the file). After the stall it seems to run OK though I
> haven't timed it.
> 
> Is this a known problem and is pushbind the wrong solution or am I doing
> something wrong?

Actually, I used to use pushbind on huge files, 100MB (but not 340MB)
months ago with no problems.  A few weeks ago I needed to process an
80MB document for a client and pushbind did just what you describe.
I've been so super-busy recently that I have not had time to go back and
investigate, but I suspect I introduced a bug into pushbind at some
point, and I need to try to fix it.

Thanks for this reminder.  I'll have a look today or this weekend, and
I'll include your example in my testing and report back.

FWIW pushbind, and all bindery ops should become a lot faster based on
architectural changes planned for Amara 2.0, but it should be able to
handle these use-cases at least as well as pulldom right now, or I
consider that a bug.


-- 
Uche Ogbuji                               Work: The Kadomo Group, Inc.
http://uche.ogbuji.net                    http://kadomo.com
http://copia.ogbuji.net                   Lead dev at http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/


More information about the 4suite mailing list