Hi Jay, thanks. great idea, in next days we'll try to do something like you'd exposed.
best, rode. --- Rode González Libnova, SL Paseo de la Castellana, 153-Madrid [t]91 449 08 94 [f]91 141 21 21 www.libnova.es > -----Mensaje original----- > De: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] > Enviado el: lunes, 15 de agosto de 2011 14:54 > Para: solr-user@lucene.apache.org > Asunto: RE: ideas for indexing large amount of pdf docs > > Note on i: Solr replication provides pretty good clustering support > out-of-the-box, including replication of multiple cores. Read the Wiki > on replication (Google +solr +replication if you don't know where it > is). > > In my experience, the problem with indexing PDFs is it takes a lot of > CPU on the document parsing side (client), not on the Solr server side. > So make sure you do that part on the client and not the server. > > Avoiding iii: > > > Suggest that you write yourself a multi-threaded performance test so > that you aren't guessing what your performance will be. > > We wrote one in Perl. It handles an individual thread (we were testing > inquiry), and we wrote a little batch file / shell script to start up > the desired number of threads. > > The main statement in our batch file (the rest just set the variables). > A Shell script would be even easier. > > for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h > %SOLRHOST% > -c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l > %SOLRLOC%\lastsynonyms.txt -z %FUZZ% > > The perl > > > #!/usr/bin/perl > > # > # Perl program to run a thread of solr testing > # > > use Getopt::Std; # For options processing > use POSIX; # For time formatting > use XML::Simple; # For processing of XML config file > use Data::Dumper; # For debugging XML config file > use HTTP::Request::Common; # For HTTP request to Solr > use HTTP::Response; > use LWP::UserAgent; # For HTTP request to Solr > > $host = "YOURHOST:8983"; > $realm = "YOUR AUTHENTICATION REALM"; > $firstlist = "firstsynonyms.txt"; > $lastlist = "lastsynonyms.txt"; > $fuzzy = ""; > > $me = $0; > > sub usage() { > print "perl $me -c iterations [-d] [-h host:port ] [-u user [-p > password]] \n"; > print "\t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r > realm]\n"; > exit(8); > } > > > # > # Process the command line options, and open the output file. > # > > getopts('dc:u:p:f:l:h:r:z:') || usage(); > > if(!$opt_c) { > usage(); > } > > $count = $opt_c; > > if($opt_u) { > $user = $opt_u; > } > > if($opt_p) { > $password = $opt_p; > } > > if($opt_h) { > $host = $opt_h; > } > > if($opt_f) { > $firstlist = $opt_f; > } > > if($opt_l) { > $lastlist = $opt_l; > } > > if($opt_r) { > $realm = $opt_r; > } > > if($opt_z) { > $fuzzy = "~" . $opt_z; > } > > $debug = $opt_d; > > > # > # If the host string does not include a :, add :80 > # > > if($host !~ /:/) { > $host = $host . ":80"; > } > > # > # Read the lists of first and last names > # > > open(SYNFILE,"<$firstlist") || die "Can't open first name list > $firstlist\n"; > while(<SYNFILE>) { > @newwords = split /,/; > for($i=0; $i <= $#newwords; ++$i) { > $newwords[$i] =~ s/^\s+//; > $newwords[$i] =~ s/\s+$//; > $newwords[$i] = lc($newwords[$i]); > } > push @firstnames, @newwords; > } > close(SYNFILE); > > open(SYNFILE,"<$lastlist") || die "Can't open last name list > $lastlist\n"; > while(<SYNFILE>) { > @newwords = split /,/; > for($i=0; $i <= $#newwords; ++$i) { > $newwords[$i] =~ s/^\s+//; > $newwords[$i] =~ s/\s+$//; > $newwords[$i] = lc($newwords[$i]); > } > push @lastnames, @newwords; > } > close(SYNFILE); > > > print "$#firstnames First Names, $#lastnames Last Names\n"; > print "User: $user\n"; > > > my $userAgent = LWP::UserAgent->new(agent => 'solrtest.pl'); > $userAgent->credentials("$host",$realm,$user,$password); > > $uri = "http://$host/solr/select"; > > $starttime = time(); > > for($c=0; $c < $count; ++$c) { > $fname = $firstnames[rand $#firstnames]; > $lname = $lastnames[rand $#lastnames]; > $response = $userAgent->request( > POST $uri, > [ > q => "lnamesyn:$lname AND fnamesyn:$fname$fuzzy", > rows => "25" > ]); > > if($debug) { > print "Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy"; > print $response->content(); > } > print "POST for $fname $lname completed, HTTP status=" . > $response->code . "\n"; > } > > $elapsed = time() - $starttime; > $average = $elapsed / $count; > > print "Time: $elapsed s ($average/request)\n"; > > > -----Original Message----- > From: Rode Gonzalez (libnova) [mailto:r...@libnova.es] > Sent: Saturday, August 13, 2011 3:50 AM > To: solr-user@lucene.apache.org > Subject: ideas for indexing large amount of pdf docs > > Hi all, > > I want to ask about the best way to implement a solution for indexing a > large amount of pdf documents between 10-60 MB each one. 100 to 1000 > users > connected simultaneously. > > I actually have 1 core of solr 3.3.0 and it works fine for a few number > of > pdf docs but I'm afraid about the moment when we enter in production > time. > > some possibilities: > > i. clustering. I have no experience in this, so it will be a bad idea > to > venture into this. > > ii. multicore solution. make some kind of hash to choose one core at > each > query (exact queries) and thus reduce the size of the individual > indexes to > consult or to consult all the cores at same time (complex queries). > > iii. do nothing more and wait for the catastrophe in the response times > :P > > > Someone with experience can help a bit to decide? > > Thanks a lot in advance. > > ----- > No se encontraron virus en este mensaje. > Comprobado por AVG - www.avg.com > Versión: 10.0.1392 / Base de datos de virus: 1520/3836 - Fecha de > publicación: 08/15/11 ----- No se encontraron virus en este mensaje. Comprobado por AVG - www.avg.com Versión: 10.0.1392 / Base de datos de virus: 1520/3836 - Fecha de publicación: 08/15/11