Re: [Wikitech-l] API vs data dumps

Alex Brollo Wed, 13 Oct 2010 15:38:15 -0700

2010/10/13 Paul Houle <[email protected]>

>
>     Don't be intimidated by working with the data dumps.  If you've got
> an XML API that does streaming processing (I used .NET's XmlReader) and
> use the old unix trick of piping the output of bunzip2 into your
> program,  it's really pretty easy.
>


When I worked into it.source (a small dump! something like 300Mby unzipped),
I used a simple do-it-yourself string python search routine  and I found it
really faster then python xml routines. I presume that my scripts are really
too rough to deserve sharing, but I encourage programmers to write a "simple
dump reader" using speed of string search. My personal trick was to build an
"index", t.i. a list of pointers to articles and name of articles  into xml
file, so that it was simple and fast to recover their content. I used it
mainly because I didn't understand API at all. ;-)

Alex
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] API vs data dumps

Reply via email to