Re: [lucy-user] Lucy::Search::RegexQuery ????

Peter Karman Wed, 18 Dec 2013 19:58:05 -0800

On 12/18/13 3:57 PM, Nick D. wrote:

I was wondering if there is a way to query a Lucy index using regular
expressions.


For example: The command `grep -i -P '65\d\s+Security' | grep -v -i -P
'(?:654|656|650|652)\s+Security'` will search for "65" followed by 1 digit
followed by any number of spaces followed by "Security" ignoring "654" "656"
"650" "652". So potential results are something like this:

"stuff here 651 Security and more stuff"
"stuff here 653 Security and more stuff"
"stuff here 655                        Security and more stuff"

but it will not return any of the below:

"stuff here 651 not followed by Security and more stuff"
"stuff here 653 not followed by Security and more stuff"

Another example is searching for an ip with `?:\d{1,3}\.){3}\d{1,3}`

Is there anyway to accomplish this with the existing api?
are there any plans to support this?
If not fully supported what is supported?
If not supported at all what approach should I take to create something like
this? (create something that converts regex to a bunch of ORQueries etc?)



Hi Nick,

There is no RegexQuery class in core as you describe it.

The closest thing on CPAN is LucyX::Search::WildcardQuery, which wasinspired by the PrefixQuery example in the Lucy docs, among other things.

There have been IRC discussions in years(!) past about porting the purePerl regex code in WildcardQuery to C and making it part of core, butnobody's has the tuits for that.

The one qualifier to your examples vs WildcardQuery is that yourexamples assume un-tokenized field values (e.g. Lucy::Plan::StringType),which means you'd have to think carefully about how to plan out yourindex schema to accommodate a regex against a phrase as well as a singleterm. The WildcardQuery algorithm is to open each internal Lexicon andexamine each term in it for matches against a regex.

Internally, the WildcardQuery class creates an ORQuery using all theterms in the Lexicon that match the query terms, so yes, that is one wayto approach this. If you're looking for examples of creating your ownquery classes, you might look at prior art inLucyX::Search::NullTermQuery as well as WildcardQuery, both on CPAN. Ialso started a project here:


https://github.com/karpet/lucyx-search-delegatequery

to make this kind of thing easier, but haven't returned to it yet tomake sure it is CPAN-ready.

All that said, having created all those Query extensions myself, Irecommend avoiding that approach if you can. Pure Perl Query extensionsare much slower than the native C classes, and they can be awkward todevelop/debug because of the unholy trinity of Query/Compiler/Matcher(much discussion about that in the lucy-dev archives).

I personally would look at a combination ofLucyX::Search::ProximityQuery and query expansion instead, usingSearch::Query::Dialect::Lucy and Search::Query::Parser. That way you canleverage the performance of the native Lucy query classes and still getthe flexibility you need for matching patterns.


Example (NOT TESTED):

# setup relevant field schema
my $searcher  = get_lucy_searcher();
my $schema    = $searcher->get_schema();
my @fieldnames = qw(
    ipaddr
    body
);
my %fields = ();

for my $f (@fieldnames) {
    $fields{$f} = {
        type     => $schema->fetch_type($f),
        analyzer => $schema->fetch_analyzer($f),
    };
}

# create query parser
my $qp = Search::Query::Parser->new(
    dialect          => 'Lucy',
    fields           => \%fields,
    croak_on_error   => 0,          # strict mode off
    sloppy           => 1,          # forgiving parser
    fixup            => 1,          # even more forgiveness
    null_term        => 'NULL',
    query_class_opts => {
        default_field => [
            qw( body )
        ],
    },
    term_expander => sub {
        my ( $term, $field ) = @_;
        return ($term) if ref $term;    # skip ranges
        if ( $field eq 'body' ) {

           # mangle a regex into an actual query
           # e.g.
           # '(?:654|656|650|652)\s+Security'
           # the array returned gets OR'd together
           return (
               qq/"654 security"/,
               qq/"656 security"/,
               qq/"650 security"/,
               qq/"652 security"/,
           );
        }
        return ($term);
    },
);

# run it
my $query = $qp->parse( qq/body:'(?:654|656|650|652)\s+Security'/ );
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );


--
Peter Karman  .  http://peknet.com/  .  [email protected]

Re: [lucy-user] Lucy::Search::RegexQuery ????

Reply via email to