Re: [Wikitech-l] Extracting text from Wikipedia

Bjoern Hoehrmann Sun, 27 Nov 2011 10:12:33 -0800

* Khalida BEN SIDI AHMED wrote:
>JWPL needs fist to create a database whose size =158 GB. For the RAM, at
>least 2 GB are necessary. I don't have neither a big hard disk neither a
>big space ram. In addition, creating such big database to just extract the
>first sentence of each article seems for me to be not the appropriate
>solution.


The dumps on http://dumps.wikimedia.org/backup-index.html have "page
abstracts" which typically contain the first sentence. I've found that
http://inamidst.com/phenny/modules/wikipedia.py (part of an IRC bot)
works quite well, at least on the english version. I'd probably use my
http://cutycapt.sf.net/ utility like so:

  % CutyCapt --url=http://en.wikipedia.org/wiki/Empire
             --user-style-string=
             "
               .mw-content-ltr > * { display: none }
               .mw-content-ltr > p:first-of-type,
               .mw-content-ltr > p:first-of-type * { display: inline }
             "
             --out=output.txt

Where output.txt would then be something like

  Please read:
  A personal appeal from 
  Wikipedia founder Jimmy Wales
  Read now 
  Empire
  From Wikipedia, the free encyclopedia
  The term empire derives from the Latin imperium (power, authority)...

You would then just have to strip the leading gibberish and possibly
fiddle with the user style sheet  to remove references for instance.
You could also just use a sophisticated HTML parser and pick simply
pick the `.mw-content-ltr > p:first-of-type` paragraph, but for just
a few articles that would require some setup cost.
-- 
Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Extracting text from Wikipedia

Reply via email to