Re: [Wikitech-l] API vs data dumps

2010-11-07 Thread Andrew Dunbar
On 14 October 2010 09:37, Alex Brollo alex.bro...@gmail.com wrote:
 2010/10/13 Paul Houle p...@ontology2.com


     Don't be intimidated by working with the data dumps.  If you've got
 an XML API that does streaming processing (I used .NET's XmlReader) and
 use the old unix trick of piping the output of bunzip2 into your
 program,  it's really pretty easy.


 When I worked into it.source (a small dump! something like 300Mby unzipped),
 I used a simple do-it-yourself string python search routine  and I found it
 really faster then python xml routines. I presume that my scripts are really
 too rough to deserve sharing, but I encourage programmers to write a simple
 dump reader using speed of string search. My personal trick was to build an
 index, t.i. a list of pointers to articles and name of articles  into xml
 file, so that it was simple and fast to recover their content. I used it
 mainly because I didn't understand API at all. ;-)

 Alex


Hi Alex. I have been doing something similar in Perl for a few years
for the English
Wiktionary. I've never been sure on the best way to store all the
index files I create
especially in code to share with other people like I would like to
happen. If you'd
like to collaborate or anyone else for that matter it would be pretty cool.

You'll find my stuff on the Toolserver:
https://fisheye.toolserver.org/browse/enwikt

Andrew Dunbar (hippietrail)


-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] API vs data dumps

2010-11-07 Thread Alex Brollo
2010/11/7 Andrew Dunbar hippytr...@gmail.com

 On 14 October 2010 09:37, Alex Brollo alex.bro...@gmail.com wrote:

 Hi Alex. I have been doing something similar in Perl for a few years
 for the English
 Wiktionary. I've never been sure on the best way to store all the
 index files I create
 especially in code to share with other people like I would like to
 happen. If you'd
 like to collaborate or anyone else for that matter it would be pretty cool.

 You'll find my stuff on the Toolserver:
 https://fisheye.toolserver.org/browse/enwikt

 Thanks Andrew. I just got a toolserver account, but don't seach for any
contribution by me... I'm very worried about the whole stuff and needed
skills. :-(

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] API vs data dumps

2010-10-13 Thread Paul Houle
  I know there's some discussion about what's appropriate for the 
Wikipedia API,  and I'd just like to share my recent experience.

 I was trying to download the Wikipedia entries for people,  of 
which I found about 800,000.   I had a scanner already written that 
could do the download,  so I got started.

 After running for about I day,  I estimated that it would take 
about 20 days to bring all of the pages down through the API (running 
single-threaded.)  At that point I gave up,  downloaded the data dump (3 
hours) and wrote a script to extract the pages -- it then took about an 
hour to the extraction,  gzip compressing the text and inserting into a 
mysql database.

 Don't be intimidated by working with the data dumps.  If you've got 
an XML API that does streaming processing (I used .NET's XmlReader) and 
use the old unix trick of piping the output of bunzip2 into your 
program,  it's really pretty easy.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] API vs data dumps

2010-10-13 Thread Alex Brollo
2010/10/13 Paul Houle p...@ontology2.com


 Don't be intimidated by working with the data dumps.  If you've got
 an XML API that does streaming processing (I used .NET's XmlReader) and
 use the old unix trick of piping the output of bunzip2 into your
 program,  it's really pretty easy.


When I worked into it.source (a small dump! something like 300Mby unzipped),
I used a simple do-it-yourself string python search routine  and I found it
really faster then python xml routines. I presume that my scripts are really
too rough to deserve sharing, but I encourage programmers to write a simple
dump reader using speed of string search. My personal trick was to build an
index, t.i. a list of pointers to articles and name of articles  into xml
file, so that it was simple and fast to recover their content. I used it
mainly because I didn't understand API at all. ;-)

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l