On 12/18/13 3:57 PM, Nick D. wrote:
I was wondering if there is a way to query a Lucy index using regular
expressions.

For example: The command `grep -i -P '65\d\s+Security' | grep -v -i -P
'(?:654|656|650|652)\s+Security'` will search for "65" followed by 1 digit
followed by any number of spaces followed by "Security" ignoring "654" "656"
"650" "652". So potential results are something like this:

"stuff here 651 Security and more stuff"
"stuff here 653 Security and more stuff"
"stuff here 655                        Security and more stuff"

but it will not return any of the below:

"stuff here 651 not followed by Security and more stuff"
"stuff here 653 not followed by Security and more stuff"

Another example is searching for an ip with `?:\d{1,3}\.){3}\d{1,3}`

Is there anyway to accomplish this with the existing api?
are there any plans to support this?
If not fully supported what is supported?
If not supported at all what approach should I take to create something like
this? (create something that converts regex to a bunch of ORQueries etc?)



Hi Nick,

There is no RegexQuery class in core as you describe it.

The closest thing on CPAN is LucyX::Search::WildcardQuery, which was inspired by the PrefixQuery example in the Lucy docs, among other things.

There have been IRC discussions in years(!) past about porting the pure Perl regex code in WildcardQuery to C and making it part of core, but nobody's has the tuits for that.

The one qualifier to your examples vs WildcardQuery is that your examples assume un-tokenized field values (e.g. Lucy::Plan::StringType), which means you'd have to think carefully about how to plan out your index schema to accommodate a regex against a phrase as well as a single term. The WildcardQuery algorithm is to open each internal Lexicon and examine each term in it for matches against a regex.

Internally, the WildcardQuery class creates an ORQuery using all the terms in the Lexicon that match the query terms, so yes, that is one way to approach this. If you're looking for examples of creating your own query classes, you might look at prior art in LucyX::Search::NullTermQuery as well as WildcardQuery, both on CPAN. I also started a project here:

https://github.com/karpet/lucyx-search-delegatequery

to make this kind of thing easier, but haven't returned to it yet to make sure it is CPAN-ready.

All that said, having created all those Query extensions myself, I recommend avoiding that approach if you can. Pure Perl Query extensions are much slower than the native C classes, and they can be awkward to develop/debug because of the unholy trinity of Query/Compiler/Matcher (much discussion about that in the lucy-dev archives).

I personally would look at a combination of LucyX::Search::ProximityQuery and query expansion instead, using Search::Query::Dialect::Lucy and Search::Query::Parser. That way you can leverage the performance of the native Lucy query classes and still get the flexibility you need for matching patterns.

Example (NOT TESTED):

# setup relevant field schema
my $searcher  = get_lucy_searcher();
my $schema    = $searcher->get_schema();
my @fieldnames = qw(
    ipaddr
    body
);
my %fields = ();

for my $f (@fieldnames) {
    $fields{$f} = {
        type     => $schema->fetch_type($f),
        analyzer => $schema->fetch_analyzer($f),
    };
}

# create query parser
my $qp = Search::Query::Parser->new(
    dialect          => 'Lucy',
    fields           => \%fields,
    croak_on_error   => 0,          # strict mode off
    sloppy           => 1,          # forgiving parser
    fixup            => 1,          # even more forgiveness
    null_term        => 'NULL',
    query_class_opts => {
        default_field => [
            qw( body )
        ],
    },
    term_expander => sub {
        my ( $term, $field ) = @_;
        return ($term) if ref $term;    # skip ranges
        if ( $field eq 'body' ) {

           # mangle a regex into an actual query
           # e.g.
           # '(?:654|656|650|652)\s+Security'
           # the array returned gets OR'd together
           return (
               qq/"654 security"/,
               qq/"656 security"/,
               qq/"650 security"/,
               qq/"652 security"/,
           );
        }
        return ($term);
    },
);

# run it
my $query = $qp->parse( qq/body:'(?:654|656|650|652)\s+Security'/ );
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );


--
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to