[Wikitech-l] automatic extraction of wikipedia entries

Pierre Jourlin Thu, 09 May 2013 00:05:15 -0700

Hello,

I'm a computer science researcher in the university of Avignon, in France. I
recently developed a software that automatically and quickly extract from an
UTF-8 text all the (longest) terms that belongs to a large set of terms. 
The term extractor works as a server and I tested it successfully with a
thesaurus made of the page's titles of fr.wikipedia.org, en.wikipedia.org
and es.wikipedia.org, i.e. 9,387,079 distinct terms composed from 4,496,195
distinct words. 
You are invited to test my demonstration at :
http://dev.termwatch.es/~jourlin/demo.php
The source code can be found at Github (condition of use, redistribution,
modification under the terms of the GNU Public License V3):
https://github.com/jourlin/FELTS


I roughly guessed that it could be of some interest for the development of
Mediawiki but I would very much appreciate any feedback before I look
further into that question.

Best regards,

Pierre Jourlin.


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] automatic extraction of wikipedia entries

Reply via email to