2010/10/13 Paul Houle <p...@ontology2.com>

>
>     Don't be intimidated by working with the data dumps.  If you've got
> an XML API that does streaming processing (I used .NET's XmlReader) and
> use the old unix trick of piping the output of bunzip2 into your
> program,  it's really pretty easy.
>

When I worked into it.source (a small dump! something like 300Mby unzipped),
I used a simple do-it-yourself string python search routine  and I found it
really faster then python xml routines. I presume that my scripts are really
too rough to deserve sharing, but I encourage programmers to write a "simple
dump reader" using speed of string search. My personal trick was to build an
"index", t.i. a list of pointers to articles and name of articles  into xml
file, so that it was simple and fast to recover their content. I used it
mainly because I didn't understand API at all. ;-)

Alex
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to