[4suite] [Amara] performance of pushbind on large files?

Robert Casties casties at mpiwg-berlin.mpg.de
Fri Dec 22 08:53:32 MST 2006


Hi all,

I have a little program that reads FileMaker FMPXMLRESULT files (one of
the worst XML formats I've seen) and writes the data into a database.

Because the files are rather big (340MB) I wrote the first version of my
program using Python pulldom. The result was not bad but it still takes
80 minutes (on a 1.6GHz Mac G5) to churn through 190,000 ROW elements
with 86 COL elements each.

So I thought maybe amara.pushbind (or pushdom) would do better and wrote
a version of the program using pushbind. The resulting program runs
nicely on small files (50 ROWs) but it takes forever to run on big files
(190k ROWs). The program essentially stops at the first call of

for f in amara.pushbind(filename, u'fm:METADATA/fm:FIELD', prefixes=fm_ns):
    fn = f.NAME

for 30 minutes, using almost full CPU (while the METADATA tag is at the
beginning of the file). After the stall it seems to run OK though I
haven't timed it.

Is this a known problem and is pushbind the wrong solution or am I doing
something wrong?

The XML structure is like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE FMPXMLRESULT PUBLIC "-//FMI//DTD FMPXMLRESULT//EN"
"/fmi/xml/FMPXMLRESULT.dtd">
<FMPXMLRESULT xmlns="http://www.filemaker.com/fmpxmlresult">
  <ERRORCODE>0</ERRORCODE>
  <PRODUCT BUILD="11/14/2005" NAME="FileMaker Web Publishing Engine"
VERSION="8.0.2.65"/>
  <DATABASE DATEFORMAT="MM/dd/yyyy" LAYOUT="WWW2" NAME="cdli_cat.fp7"
RECORDS="194154" TIMEFORMAT="HH:mm:ss"/>
  <METADATA>
    <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="primary_publication"
TYPE="TEXT"/>
[... more FIELD tags (about 80 more)]
  </METADATA>
  <RESULTSET FOUND="194154">
    <ROW MODID="118" RECORDID="1">
      <COL>
        <DATA>ATU 3, pl. 011, W 6435,a</DATA>
      </COL>
      <COL>
        <DATA>Englund, Robert K. and Nissen, Hans J.</DATA>
      </COL>
[... more COL tags (about 80 more)]
    </ROW>
[... more ROW tags like above (about 190k more)]
  </RESULTSET>
</FMPXMLRESULT>

(I told you its ugly ;-)

Thanks for enlightening

	Robert

-- 
Dr. Robert Casties -- Information Technology Group
Max Planck Institute for the History of Science
Boltzmannstr. 22, D-14195 Berlin
Tel: +49/30/22667-342 Fax: -340


More information about the 4suite mailing list