[ANN] Luke + Hadoop, alpha version

Andrzej Bialecki Fri, 10 Jul 2009 03:08:57 -0700

Hi all,

I prepared a special edition of Luke, the Lucene Index Toolbox, that
works with Lucene indexes located on any filesystem supported by Hadoop
0.19.1.


At the moment I'm looking for feedback how to best integrate this
functionality with various bits and pieces of Luke. You can download the
jar file from a direct link:

        http://www.getopt.org/luke/lukeall-0.9.3.jar

This JAR contains all dependencies needed to connect to HDFS, KFS or
S3/S3n filesystems, although I tested it only with HDFS so far.

Note: this version of Luke still uses Lucene 2.4.1, I didn't start
integrating 2.9-dev yet.

Quick info for the impatient: yes, you can browse the content, view
terms and documents, perform searching, explaining, etc. See below for
more details.

The initial Open dialog is not integrated yet with this functionality.
After you start Luke, you need to dismiss this dialog, go to Plugins /
Hadoop Plugin, and enter the full URI of the index in the textfield, and
then press the Open button. There is no filesystem browsing for now -
you need to know the full URI in advance.

Current functionality is as follows:

- you can open a single index or partial (sharded) indexes located in
part-NNNNN/ subdirectories (this is a typical layout resulting from
using common map-reduce output formats). In the latter case you will get
a single view of partial indexes, thanks to MultiReader.

- access is read-only - most FileSystem-s don't support file updates, so
it was easiest to disable write access altogether for now.

- most of Luke functionality works properly, thanks to the excellent
design of IndexReader API. Some operations are disabled due to read-only
access, some other information (like top terms) is not populated by
default due to a high IO cost, but can be requested explicitly.

- the plugin keeps track of the amount of IO reads - I found this very
comforting when opening large indexes over a slow VPN line ... There is
a "Clear" button on the plugin's tab that resets the counters - this is
useful to see how much IO is needed to complete a specific operation.

- a lot of code has been reworked to avoid UI stalls when doing slow IO,
which means that you can see the amount of IO being done, but the UI is
blocked with a modal dialog. It's a bit unwieldy, but other solutions
would require too much refactoring.

Any feedback is welcome - please keep in mind that this is an early
preview. Also, various UI glitches are probably related to the Thinlet
toolkit - again, one day I may re-write Luke using something else, but
for now I don't have the strength to do it.  :)




--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[ANN] Luke + Hadoop, alpha version

Reply via email to