[Wikitext-l] An update on search

Oren Bochman Fri, 10 Feb 2012 02:24:13 -0800

Hi Good People.

I'd like to thank everyone for helping me with labs, code reviews and other 
difficulties.



Recent Search Related Activity:

1. Branched the project to 
svn:https://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-3
2. Upgraded the code from is Lucene 2.4.0  to 2.9.1 last December and I've been 
reviewing and committing to svn.
3. I've migrated the project from Ant To Maven.
4. We have placed the Maven based code is in Continuous Integration on Jenkins 
with JUnit PMD & Coverage report in place.
5. With the help of some excellent volunteers, I've set up a lab to test the 
build using simple English Wikipedia.
6. One major setback is that there is no proper testing or deployment possible 
for update. For this reason I've not closed any of the bugs I've worked on. 
(Access to the production machines is considered too sensitive now that there 
are labs. At this time labs do not have capacity for setting up a  At this Labs 
do not have There is no labs Setting up a lab which replicates the production 
has been unsuccessful,. Once the scripts are sanitized, and production search 
will be put into puppet it may be possible. However as the labs environment is 
a far cry from the production in terms of both content, and updating.
7. I've done some rough analysis and design for  the next version of search 
which will feature computational linguistics support for the many languages 
being used in wikipedias. Search analytics (optimizing ranking) and innovative 
content analytics for Ranking but including objective metrics on neutral point 
of view (via. sentiment analysis), notability (via. semantic algorithms), 
checking of external links (anti-link spam).
8. We are trying to relicense the search code so that the Lucene community in 
the apache projects will become more involved. It may be necessary also to 
relicense MWdumper since the two projects are related.

In the pipeline:

1. Testing and Integration into Lucene a ANTLR grammar which parses wiki syntax 
Tables. Once successful it will be also integrated 
   into SOLR and become the prototype of more difficult wiki syntax analysis 
tasks.
2. I've started on some of the NLP tasks:
 a. a Transducer of Devanagari scripts to IPA. (in HFST)
 b. a Transducer of English to IPA with the common goal to index named entities 
based on their sound in a language agnostic fashion. 
    (also in HFST)
 c. Extraction of phonetics data from English Wiktionary.
 d. conversion of CMU pronunciation dictionary to IPA.
 e. Extraction of bi-lingual lexicons from Wiktionary and conversion to 
Apertium Formats.
 f. Unsupervised learning of morphologies using minimum description length.
 g. Sentence boundary detection (SVM and MaxEnt Models).
 h. Topological text alignment algorithm.
3. A Maven based POM for building and packaging SOLR + our extension for 
distributed use.
4. A repository for NLP artifacts built from WikiContent.


Oren Bochman

MediaWiki Search Lead.




_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

[Wikitext-l] An update on search

Reply via email to