On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker <lb...@reallywow.com> wrote:
> Hi,
>
> I am trying to provide a means to search our corpus of nearly 2
> million fulltext astronomy and physics articles using regular
> expressions. A small percentage of our users need to be able to
> locate, for example, certain types of identifiers that are present
> within the fulltext (grant numbers, dataset identifers, etc).
>
> My straightforward attempts to do this using RegexQuery have been
> successful only in the sense that I get the results I'm looking for.
> The performance, however, is pretty terrible, with most queries taking
> five minutes or longer. Is this the performance I should expect
> considering the size of my index and the massive number of terms? Are
> there any alternative approaches I could try?
>
> Things I've already tried:
>  * reducing the sheer number of terms by adding a LengthFilter,
> min=6, to my index analysis chain
>  * swapping in the JakartaRegexpCapabilities
>
> Things I intend to try if no one has any better suggestions:
>  * chunk up the index and search concurrently, either by sharding or
> using a RangeQuery based on document id
>
> Any suggestions appreciated.
>

This RegexQuery is not really scalable in my opinion, its always
linear to the number of terms except in super-rare circumstances where
it can compute a "common prefix" (and slow to boot).

You can try svn trunk's RegexpQuery <-- don't forget the "p", instead
from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
etc)

The performance is faster, but keep in mind its only as good as the
regular expressions, if the regular expressions are like /.*foo.*/,
then
its just as slow as wildcard of *foo*.

-- 
lucidimagination.com

Reply via email to