[lucy-user] 32 bit CentOS Indexing Question

Nick D. Tue, 28 Jan 2014 11:28:01 -0800

Hi all,

I am having issues indexing large files. The format of the files I'm
indexing is a syslog formatted file that is pretty large around 4.4gb.
During the process I am only adding docs to the index making a doc per line
of the syslog log file and committing once at the very end. During this
process the index grows to a relatively enormous size (around 14gb) and (im
guessing) during the commit it uses huge amounts of ram slowing the computer
down to a crawl and then once the commit is done the index size shrinks to
4.1gb on a 64 bit system and on a 32 bit system I get a malloc error saying
it can't allocate more space. Each box has the same amount of ram and they
are the same OS only 1 is 32-bit and the other is 64.


Questions:

Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
Are there any 32 bit limitations of Lucy?
Why does the index file grow so large and then shrinks after commit is done?
Should I commit more often?
Would committing often slow down the indexing process?
Would committing often make the over growth of the index go away?

Any help would be greatly appreciated,

Nick D.


Code Snippet:
# Create Schema.
my $schema = Lucy::Plan::Schema->new;
my $case_folder  = Lucy::Analysis::CaseFolder->new;
my $tokenizer    = Lucy::Analysis::RegexTokenizer->new; #purposely leave out
the Steemer
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
      analyzers => [ $case_folder, $tokenizer ],  
      );  
my $unstored_full_text_type = Lucy::Plan::FullTextType->new(
                analyzer => $polyanalyzer,
                stored => 0,
                );  
my $unindexed_int_type = Lucy::Plan::Int64Type->new( indexed => 0, sortable
=> 1, );
my $unindexed_string_type = Lucy::Plan::StringType->new( indexed => 0,
sortable => 1, );

$schema->spec_field( name => 'line', type => $unstored_full_text_type );
$schema->spec_field( name => 'offset',     type => $unindexed_int_type );
$schema->spec_field( name => 'time_sec',     type => $unindexed_string_type
);

.........................

open(my $fh, '<', $filename ) or die "Can't open '$filename': $!";
my $offset = 0;
my $time = 0;
while( my $line = <$fh> ) {

   $line =~ /^\w+\s+\d+\s+(\d+)\:(\d+)\:(\d+)/;

   $time = ($1*60*60) + ($2*60) + $3;

   my %doc = (
         line      => $line,
         offset     => $offset,
         time_sec   => sprintf("%0.5d", $time),
         );

   #print Dumper(\%doc);
   $indexer->add_doc(\%doc);  # ta-da!
   $offset = tell($fh);
}

$indexer->commit;
-------------------------------------end of
snippet---------------------------------------

Example format of file to be indexed

Mar 12 12:27:00 server3 named[32172]: lame server resolving
'jakarta5.wasantara.net.id' (in 'wasantara.net.id'?): 202.159.65.171#53 
Mar 12 12:27:03 server3 named[32173]: lame server resolving
'jakarta5.wasantara.net.id' (in 'wasantara.net.id'?): 202.159.65.171#



--
View this message in context: 
http://lucene.472066.n3.nabble.com/32-bit-CentOS-Indexing-Question-tp4114036.html
Sent from the lucy-user mailing list archive at Nabble.com.

[lucy-user] 32 bit CentOS Indexing Question

Reply via email to