[lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Gupta, Rajiv Wed, 14 Sep 2016 00:05:56 -0700

I'm creating indexes on multiple subfolders under one parent folder.

Indexes are created on multiple folders since files are getting created in 
parallel and I want to avoid segment locking between multiple indexers.


One of my applications creates the directory structure with lots of log files 
within different subfolders.

I'm indexing all those files in parallel as and when they are created.

The directory structure looks like this:
TopDir/00_log.log
      /01_log2.log
      /.lucyindexer/1/seg_1
                     /seg_2
      /03_log3.log
      /03_log3/log31.log
              /log32.log
              /.lucyindexer/1/seg_1
                             /seg 2
              /log32/log321.log
                    /log322.log
                    /.lucyindexer/1/seg_1
                                   /seg_2
                                 /2/seg_1



This works fine, and while my application is running all log files get indexed 
as well.
Search is a different application which does following:
1.    Scan through all the directories till .lucyindexer/1 and create a list of 
all such folders. I use File::Find<https://metacpan.org/pod/File::Find> to do 
that.
2.    Create searchers using 
Lucy::Search::IndexSearcher<https://metacpan.org/pod/Lucy::Search::IndexSearcher>
 in loop and add all the searchers to 
Lucy::Search::PolySearcher<https://metacpan.org/pod/Lucy::Search::PolySearcher>

My code looks like this:


my $schema;



for my $index ( @all_dirs ) {

    chomp $index;

    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );

    push @searchers, $indexer;

    $schema = $indexer->get_schema;

}



# Poly server is the only way to get all search results combined.

my $poly_searcher = Lucy::Search::PolySearcher->new(

    schema    => $schema,

    searchers => \@searchers,

);



my $query_parser = Lucy::Search::QueryParser->new(

    schema => $poly_searcher->get_schema,

    fields => ['title'],

);



# Build up a Query.

my $q = "1 2 3 4 5 6 7 11 12 13 14 18";



my $query = $query_parser->parse( $q );



# Execute the Query and get a Hits object.

my $hits = $poly_searcher->hits(

    query      => $query,

    num_wanted => -1,       # -1 equivalent to all results



    # sort_spec => $sort_spec,

);



while ( my $hit = $hits->next ) {



    ## Do some operation

}


This runs and returns the expected results. However, the performance is really 
bad when the directory structure is deeply nested.
I did profiling using Devel::NYTProf<https://metacpan.org/pod/Devel::NYTProf> 
and found two places where the maximum time was taken:
1.    While scanning the directory. (This I will try to solve by generating a 
list of directories while the application is generating the indexes).
2.    When creating the searchers using Lucy::Search::IndexSearcher. This takes 
maximum time when running in loop for all indexed directories.
To solve the item #2 I tried to generate a Lucy::Search::IndexSearcher object 
for different index folders using 
Parallel::ForkManager<https://metacpan.org/pod/Parallel::ForkManager> but I got 
the following error:
The storable module was unable to store the child's data structure to the temp 
file "/tmp/Parallel-ForkManager-27339-27366.txt": Storable serialization not 
implemented for Lucy::Search::IndexSearcher at 
/usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clownfish.pm 
line 93
Using following code:
my $pm = new Parallel::ForkManager( $max_procs );

$pm->run_on_finish(
    sub {
        my ( $pid, $exit_code, $ident, $exit_signal, $core_dump, $index ) = @_;
        print Dumper $index;
        push( @searchers, $index );
    }
);

for my $index ( @all_dirs ) {
    chomp $index;
    my $forkpid = $pm->start( $index ) and next;    #fork
    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
    $pm->finish( 0, \$indexer );
}

$pm->wait_all_children;
This whole process takes up to 60-120 seconds for a large log directory. At the 
end of the whole process I create a nested JSON object from all search results 
to display using JQuery.
I'm looking for ideas to improve its performance. Any idea how to create 
multiple searchers using Parallel::ForkManager or any other method? Or some 
other way to improve the search performance.
Also, is there any way I can merge all the indexes in one place?
Thanks,
Rajiv Gupta

[lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Reply via email to