Re: [Wikitech-l] API vs data dumps
On 14 October 2010 09:37, Alex Brollo alex.bro...@gmail.com wrote: 2010/10/13 Paul Houle p...@ontology2.com Don't be intimidated by working with the data dumps. If you've got an XML API that does streaming processing (I used .NET's XmlReader) and use the old unix trick of piping the output of bunzip2 into your program, it's really pretty easy. When I worked into it.source (a small dump! something like 300Mby unzipped), I used a simple do-it-yourself string python search routine and I found it really faster then python xml routines. I presume that my scripts are really too rough to deserve sharing, but I encourage programmers to write a simple dump reader using speed of string search. My personal trick was to build an index, t.i. a list of pointers to articles and name of articles into xml file, so that it was simple and fast to recover their content. I used it mainly because I didn't understand API at all. ;-) Alex Hi Alex. I have been doing something similar in Perl for a few years for the English Wiktionary. I've never been sure on the best way to store all the index files I create especially in code to share with other people like I would like to happen. If you'd like to collaborate or anyone else for that matter it would be pretty cool. You'll find my stuff on the Toolserver: https://fisheye.toolserver.org/browse/enwikt Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] API vs data dumps
2010/11/7 Andrew Dunbar hippytr...@gmail.com On 14 October 2010 09:37, Alex Brollo alex.bro...@gmail.com wrote: Hi Alex. I have been doing something similar in Perl for a few years for the English Wiktionary. I've never been sure on the best way to store all the index files I create especially in code to share with other people like I would like to happen. If you'd like to collaborate or anyone else for that matter it would be pretty cool. You'll find my stuff on the Toolserver: https://fisheye.toolserver.org/browse/enwikt Thanks Andrew. I just got a toolserver account, but don't seach for any contribution by me... I'm very worried about the whole stuff and needed skills. :-( Alex ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] API vs data dumps
I know there's some discussion about what's appropriate for the Wikipedia API, and I'd just like to share my recent experience. I was trying to download the Wikipedia entries for people, of which I found about 800,000. I had a scanner already written that could do the download, so I got started. After running for about I day, I estimated that it would take about 20 days to bring all of the pages down through the API (running single-threaded.) At that point I gave up, downloaded the data dump (3 hours) and wrote a script to extract the pages -- it then took about an hour to the extraction, gzip compressing the text and inserting into a mysql database. Don't be intimidated by working with the data dumps. If you've got an XML API that does streaming processing (I used .NET's XmlReader) and use the old unix trick of piping the output of bunzip2 into your program, it's really pretty easy. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] API vs data dumps
2010/10/13 Paul Houle p...@ontology2.com Don't be intimidated by working with the data dumps. If you've got an XML API that does streaming processing (I used .NET's XmlReader) and use the old unix trick of piping the output of bunzip2 into your program, it's really pretty easy. When I worked into it.source (a small dump! something like 300Mby unzipped), I used a simple do-it-yourself string python search routine and I found it really faster then python xml routines. I presume that my scripts are really too rough to deserve sharing, but I encourage programmers to write a simple dump reader using speed of string search. My personal trick was to build an index, t.i. a list of pointers to articles and name of articles into xml file, so that it was simple and fast to recover their content. I used it mainly because I didn't understand API at all. ;-) Alex ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l