Re: RegexQuery performance

2011-12-12 Thread Jay Luker
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson erickerick...@gmail.com wrote:
 My off-the-top-of-my-head notion is you implement a
 Filter whose job is to emit some special tokens when
 you find strings like this that allow you to search without
 regexes. For instance, in the example you give, you could
 index something like...oh... I don't know, ###VER### as
 well as the normal text of IRAS-A-FPA-3-RDR-IMPS-V6.0.
 Now, when searching for docs with the pattern you used
 as an example, you look for ###VER### instead. I guess
 it all depends on how many regexes you need to allow.
 This wouldn't work at all if you allow users to put in arbitrary
 regexes, but if you have a small enough number of patterns
 you'll allow, something like this could work.

This is a great suggestion. I think the number of users that need this
feature, as well as the variety of regexs that would be used, is small
enough that it could definitely work. I turns it into a problem of
collecting the necessary regexes, plus the UI details.

Thanks!
--jay


Re: RegexQuery performance

2011-12-10 Thread Jay Luker
Hi Erick,

On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson erickerick...@gmail.com wrote:
 Could you show us some examples of the kinds of things
 you're using regex for? I.e. the raw text and the regex you
 use to match the example?

Sure!

An example identifier would be IRAS-A-FPA-3-RDR-IMPS-V6.0, which
identifies a particular Planetary Data System data set. Another
example is ULY-J-GWE-8-NULL-RESULTS-V1.0. These kind of strings
frequently appear in the references section of the articles, so the
context looks something like,

 ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System
Tholen, D. J. 1989, in Asteroids II, ed ... 

The simple  straightforward regex I've been using is
/[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I
haven't put my mind to it because I assumed the primary performance
issue was elsewhere.

 The reason I ask is that perhaps there are other approaches,
 especially thinking about some clever analyzing at index time.

 For instance, perhaps NGrams are an option. Perhaps
 just making WordDelimiterFilterFactory do its tricks. Perhaps.

WordDelimiter does help in the sense that if you search for a specific
identifier you will usually find fairly accurate results, even for
cases where the hyphens resulted in the term being broken up. But I'm
not sure how WordDelimiter can help if I want to search for a pattern.

I tried a few tweaks to the index, like putting a minimum character
count for terms, making sure WordDelimeter's preserveOriginal is
turned on, indexing without lowercasing so that I don't have to use
Pattern.CASE_INSENSITIVE. Performance was not improved significantly.

The new RegexpQuery mentioned by R. Muir looks promising, but I
haven't built an instance of trunk yet to try it out. Any ohter
suggestions appreciated.

Thanks!
--jay


 In other words, this could be an XY problem

 Best
 Erick

 On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker lb...@reallywow.com wrote:
 Hi,

 I am trying to provide a means to search our corpus of nearly 2
 million fulltext astronomy and physics articles using regular
 expressions. A small percentage of our users need to be able to
 locate, for example, certain types of identifiers that are present
 within the fulltext (grant numbers, dataset identifers, etc).

 My straightforward attempts to do this using RegexQuery have been
 successful only in the sense that I get the results I'm looking for.
 The performance, however, is pretty terrible, with most queries taking
 five minutes or longer. Is this the performance I should expect
 considering the size of my index and the massive number of terms? Are
 there any alternative approaches I could try?

 Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
 min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

 Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
 using a RangeQuery based on document id

 Any suggestions appreciated.


 This RegexQuery is not really scalable in my opinion, its always
 linear to the number of terms except in super-rare circumstances where
 it can compute a common prefix (and slow to boot).

 You can try svn trunk's RegexpQuery -- don't forget the p, instead
 from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
 etc)

 The performance is faster, but keep in mind its only as good as the
 regular expressions, if the regular expressions are like /.*foo.*/,
 then
 its just as slow as wildcard of *foo*.

 --
 lucidimagination.com


Re: RegexQuery performance

2011-12-10 Thread Erick Erickson
Hmmm, I don't know all that much about the universe
you're searching (I'm *really* sorry about that, but I
couldn't resist) but I wonder if you can't turn the problem
on its head and do your regex stuff at index time instead.

My off-the-top-of-my-head notion is you implement a
Filter whose job is to emit some special tokens when
you find strings like this that allow you to search without
regexes. For instance, in the example you give, you could
index something like...oh... I don't know, ###VER### as
well as the normal text of IRAS-A-FPA-3-RDR-IMPS-V6.0.
Now, when searching for docs with the pattern you used
as an example, you look for ###VER### instead. I guess
it all depends on how many regexes you need to allow.
This wouldn't work at all if you allow users to put in arbitrary
regexes, but if you have a small enough number of patterns
you'll allow, something like this could work.

The Filter I'm thinking of might behave something like a
SynonymFilter and emit multiple tokens at the same position.
You'd have to take some care that the *query* part of the
analyzer chain didn't undo whatever special symbols you used,
but that's all do-able.

I guess the idea here is that if you can map out all the kinds
of regex patterns you want to apply at query time and apply
them at index time instead it might work. Then you have to
work out how to allow the users to pick the special patterns,
but that's a UI problem...

From a fortune cookie:
A programmer had a problem that he tried to solve with
regular expressions. Now he has two problems G

Best
Erick

On Sat, Dec 10, 2011 at 9:20 AM, Jay Luker lb...@reallywow.com wrote:
 Hi Erick,

 On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Could you show us some examples of the kinds of things
 you're using regex for? I.e. the raw text and the regex you
 use to match the example?

 Sure!

 An example identifier would be IRAS-A-FPA-3-RDR-IMPS-V6.0, which
 identifies a particular Planetary Data System data set. Another
 example is ULY-J-GWE-8-NULL-RESULTS-V1.0. These kind of strings
 frequently appear in the references section of the articles, so the
 context looks something like,

  ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System
 Tholen, D. J. 1989, in Asteroids II, ed ... 

 The simple  straightforward regex I've been using is
 /[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I
 haven't put my mind to it because I assumed the primary performance
 issue was elsewhere.

 The reason I ask is that perhaps there are other approaches,
 especially thinking about some clever analyzing at index time.

 For instance, perhaps NGrams are an option. Perhaps
 just making WordDelimiterFilterFactory do its tricks. Perhaps.

 WordDelimiter does help in the sense that if you search for a specific
 identifier you will usually find fairly accurate results, even for
 cases where the hyphens resulted in the term being broken up. But I'm
 not sure how WordDelimiter can help if I want to search for a pattern.

 I tried a few tweaks to the index, like putting a minimum character
 count for terms, making sure WordDelimeter's preserveOriginal is
 turned on, indexing without lowercasing so that I don't have to use
 Pattern.CASE_INSENSITIVE. Performance was not improved significantly.

 The new RegexpQuery mentioned by R. Muir looks promising, but I
 haven't built an instance of trunk yet to try it out. Any ohter
 suggestions appreciated.

 Thanks!
 --jay


 In other words, this could be an XY problem

 Best
 Erick

 On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker lb...@reallywow.com wrote:
 Hi,

 I am trying to provide a means to search our corpus of nearly 2
 million fulltext astronomy and physics articles using regular
 expressions. A small percentage of our users need to be able to
 locate, for example, certain types of identifiers that are present
 within the fulltext (grant numbers, dataset identifers, etc).

 My straightforward attempts to do this using RegexQuery have been
 successful only in the sense that I get the results I'm looking for.
 The performance, however, is pretty terrible, with most queries taking
 five minutes or longer. Is this the performance I should expect
 considering the size of my index and the massive number of terms? Are
 there any alternative approaches I could try?

 Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
 min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

 Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
 using a RangeQuery based on document id

 Any suggestions appreciated.


 This RegexQuery is not really scalable in my opinion, its always
 linear to the number of terms except in super-rare circumstances where
 it can compute a common prefix (and slow to boot).

 

Re: RegexQuery performance

2011-12-09 Thread Erick Erickson
Could you show us some examples of the kinds of things
you're using regex for? I.e. the raw text and the regex you
use to match the example?

The reason I ask is that perhaps there are other approaches,
especially thinking about some clever analyzing at index time.

For instance, perhaps NGrams are an option. Perhaps
just making WordDelimiterFilterFactory do its tricks. Perhaps.

In other words, this could be an XY problem

Best
Erick

On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker lb...@reallywow.com wrote:
 Hi,

 I am trying to provide a means to search our corpus of nearly 2
 million fulltext astronomy and physics articles using regular
 expressions. A small percentage of our users need to be able to
 locate, for example, certain types of identifiers that are present
 within the fulltext (grant numbers, dataset identifers, etc).

 My straightforward attempts to do this using RegexQuery have been
 successful only in the sense that I get the results I'm looking for.
 The performance, however, is pretty terrible, with most queries taking
 five minutes or longer. Is this the performance I should expect
 considering the size of my index and the massive number of terms? Are
 there any alternative approaches I could try?

 Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
 min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

 Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
 using a RangeQuery based on document id

 Any suggestions appreciated.


 This RegexQuery is not really scalable in my opinion, its always
 linear to the number of terms except in super-rare circumstances where
 it can compute a common prefix (and slow to boot).

 You can try svn trunk's RegexpQuery -- don't forget the p, instead
 from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
 etc)

 The performance is faster, but keep in mind its only as good as the
 regular expressions, if the regular expressions are like /.*foo.*/,
 then
 its just as slow as wildcard of *foo*.

 --
 lucidimagination.com


RegexQuery performance

2011-12-08 Thread Jay Luker
Hi,

I am trying to provide a means to search our corpus of nearly 2
million fulltext astronomy and physics articles using regular
expressions. A small percentage of our users need to be able to
locate, for example, certain types of identifiers that are present
within the fulltext (grant numbers, dataset identifers, etc).

My straightforward attempts to do this using RegexQuery have been
successful only in the sense that I get the results I'm looking for.
The performance, however, is pretty terrible, with most queries taking
five minutes or longer. Is this the performance I should expect
considering the size of my index and the massive number of terms? Are
there any alternative approaches I could try?

Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
using a RangeQuery based on document id

Any suggestions appreciated.

Thanks,
--jay


Re: RegexQuery performance

2011-12-08 Thread Robert Muir
On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker lb...@reallywow.com wrote:
 Hi,

 I am trying to provide a means to search our corpus of nearly 2
 million fulltext astronomy and physics articles using regular
 expressions. A small percentage of our users need to be able to
 locate, for example, certain types of identifiers that are present
 within the fulltext (grant numbers, dataset identifers, etc).

 My straightforward attempts to do this using RegexQuery have been
 successful only in the sense that I get the results I'm looking for.
 The performance, however, is pretty terrible, with most queries taking
 five minutes or longer. Is this the performance I should expect
 considering the size of my index and the massive number of terms? Are
 there any alternative approaches I could try?

 Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
 min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

 Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
 using a RangeQuery based on document id

 Any suggestions appreciated.


This RegexQuery is not really scalable in my opinion, its always
linear to the number of terms except in super-rare circumstances where
it can compute a common prefix (and slow to boot).

You can try svn trunk's RegexpQuery -- don't forget the p, instead
from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
etc)

The performance is faster, but keep in mind its only as good as the
regular expressions, if the regular expressions are like /.*foo.*/,
then
its just as slow as wildcard of *foo*.

-- 
lucidimagination.com