Hi all;
This weekend I worked on full-text search of data libraries. Kimberly had 
mentioned it earlier on the list and issue 418 is tracking the enhancement 
suggestion:

https://bitbucket.org/galaxy/galaxy-central/issue/418/extend-search-capabilities-vs-libraries

I decided to tackle the hardest issue first -- full text searching of
data library items -- with the idea of putting a framework in place
that could then be extended and finalized. The approach used
separates indexing and search functionality from Galaxy itself; two
configurable URLs are called:

fulltext_index_url = http://localhost:8090/index
fulltext_find_url = http://localhost:8090/find

The first gets passed a CSV file of identifiers and files to be
indexed, and the second retrieves the IDs based on a search term.

A small server uses Lucene on the backend to do all the full-text 
indexing and lookup:

https://github.com/chapmanb/kwd-doc-find

This is meant to be easy to setup and run, but a default Galaxy-only
installation could also implement the index and search itself to provide
much simpler functionality that searches against filenames or
descriptions of library items. For a pure Galaxy default, Whoosh 
looks promising:

https://bitbucket.org/mchaput/whoosh/wiki/Home

On the Galaxy side, there are two patches. The first is a script that
prepares a file for indexing and submit it to the index URL. This would
be run from a cronjob to keep the indexes fresh:

https://bitbucket.org/chapmanb/galaxy-central/changeset/b47d1bfa52da

The second uses the search box in the top level Data Library grid to 
do full searching of library items. It reuses all the display and
permissions machinery, making adjustments to handle displaying
a set of collected search result files:

https://bitbucket.org/chapmanb/galaxy-central/changeset/c038fd24cf48

This is working well here and scaled nicely to ~1000 items in our
current data library. I have several ideas for enhancements after this
initial version, but thought I would first discuss with the Galaxy team
to see if this is of interest and takes a reasonable approach.

If so, the easiest working strategy would be for me to submit patches to
the bug report that y'all could check and approve so I could stay in
sync with galaxy-central as much as possible. The two above should
apply cleanly now (with a couple of stray nglims configuration lines in
the first; sorry) and we could build off of that.

Happy to hear any thoughts or feedback. Thanks,
Brad
_______________________________________________
galaxy-dev mailing list
galaxy-dev@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev

Reply via email to