2010/10/13 Paul Houle <p...@ontology2.com> > > Don't be intimidated by working with the data dumps. If you've got > an XML API that does streaming processing (I used .NET's XmlReader) and > use the old unix trick of piping the output of bunzip2 into your > program, it's really pretty easy. >
When I worked into it.source (a small dump! something like 300Mby unzipped), I used a simple do-it-yourself string python search routine and I found it really faster then python xml routines. I presume that my scripts are really too rough to deserve sharing, but I encourage programmers to write a "simple dump reader" using speed of string search. My personal trick was to build an "index", t.i. a list of pointers to articles and name of articles into xml file, so that it was simple and fast to recover their content. I used it mainly because I didn't understand API at all. ;-) Alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l