I'm creating indexes on multiple subfolders under one parent folder.
Indexes are created on multiple folders since files are getting created in
parallel and I want to avoid segment locking between multiple indexers.
One of my applications creates the directory structure with lots of log files
within different subfolders.
I'm indexing all those files in parallel as and when they are created.
The directory structure looks like this:
TopDir/00_log.log
/01_log2.log
/.lucyindexer/1/seg_1
/seg_2
/03_log3.log
/03_log3/log31.log
/log32.log
/.lucyindexer/1/seg_1
/seg 2
/log32/log321.log
/log322.log
/.lucyindexer/1/seg_1
/seg_2
/2/seg_1
This works fine, and while my application is running all log files get indexed
as well.
Search is a different application which does following:
1. Scan through all the directories till .lucyindexer/1 and create a list of
all such folders. I use File::Find<https://metacpan.org/pod/File::Find> to do
that.
2. Create searchers using
Lucy::Search::IndexSearcher<https://metacpan.org/pod/Lucy::Search::IndexSearcher>
in loop and add all the searchers to
Lucy::Search::PolySearcher<https://metacpan.org/pod/Lucy::Search::PolySearcher>
My code looks like this:
my $schema;
for my $index ( @all_dirs ) {
chomp $index;
my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
push @searchers, $indexer;
$schema = $indexer->get_schema;
}
# Poly server is the only way to get all search results combined.
my $poly_searcher = Lucy::Search::PolySearcher->new(
schema => $schema,
searchers => \@searchers,
);
my $query_parser = Lucy::Search::QueryParser->new(
schema => $poly_searcher->get_schema,
fields => ['title'],
);
# Build up a Query.
my $q = "1 2 3 4 5 6 7 11 12 13 14 18";
my $query = $query_parser->parse( $q );
# Execute the Query and get a Hits object.
my $hits = $poly_searcher->hits(
query => $query,
num_wanted => -1, # -1 equivalent to all results
# sort_spec => $sort_spec,
);
while ( my $hit = $hits->next ) {
## Do some operation
}
This runs and returns the expected results. However, the performance is really
bad when the directory structure is deeply nested.
I did profiling using Devel::NYTProf<https://metacpan.org/pod/Devel::NYTProf>
and found two places where the maximum time was taken:
1. While scanning the directory. (This I will try to solve by generating a
list of directories while the application is generating the indexes).
2. When creating the searchers using Lucy::Search::IndexSearcher. This takes
maximum time when running in loop for all indexed directories.
To solve the item #2 I tried to generate a Lucy::Search::IndexSearcher object
for different index folders using
Parallel::ForkManager<https://metacpan.org/pod/Parallel::ForkManager> but I got
the following error:
The storable module was unable to store the child's data structure to the temp
file "/tmp/Parallel-ForkManager-27339-27366.txt": Storable serialization not
implemented for Lucy::Search::IndexSearcher at
/usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clownfish.pm
line 93
Using following code:
my $pm = new Parallel::ForkManager( $max_procs );
$pm->run_on_finish(
sub {
my ( $pid, $exit_code, $ident, $exit_signal, $core_dump, $index ) = @_;
print Dumper $index;
push( @searchers, $index );
}
);
for my $index ( @all_dirs ) {
chomp $index;
my $forkpid = $pm->start( $index ) and next; #fork
my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
$pm->finish( 0, \$indexer );
}
$pm->wait_all_children;
This whole process takes up to 60-120 seconds for a large log directory. At the
end of the whole process I create a nested JSON object from all search results
to display using JQuery.
I'm looking for ideas to improve its performance. Any idea how to create
multiple searchers using Parallel::ForkManager or any other method? Or some
other way to improve the search performance.
Also, is there any way I can merge all the indexes in one place?
Thanks,
Rajiv Gupta