RE: ideas for indexing large amount of pdf docs

Rode González Tue, 16 Aug 2011 05:14:38 -0700

Hi Jay, thanks. great idea, in next days we'll try to do something like
you'd exposed.


best,
rode.

---
Rode González
Libnova, SL
Paseo de la Castellana, 153-Madrid
[t]91 449 08 94  [f]91 141 21 21
www.libnova.es

> -----Mensaje original-----
> De: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
> Enviado el: lunes, 15 de agosto de 2011 14:54
> Para: solr-user@lucene.apache.org
> Asunto: RE: ideas for indexing large amount of pdf docs
> 
> Note on i:  Solr replication provides pretty good clustering support
> out-of-the-box, including replication of multiple cores.  Read the Wiki
> on replication (Google +solr +replication if you don't know where it
> is).
> 
> In my experience, the problem with indexing PDFs is it takes a lot of
> CPU on the document parsing side (client), not on the Solr server side.
> So make sure you do that part on the client and not the server.
> 
> Avoiding iii:
> 
> 
> Suggest that you write yourself a multi-threaded performance test so
> that you aren't guessing what your performance will be.
> 
> We wrote one in Perl.  It handles an individual thread (we were testing
> inquiry), and we wrote a little batch file / shell script to start up
> the desired number of threads.
> 
> The main statement in our batch file (the rest just set the variables).
> A  Shell script would be even easier.
> 
> for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h
> %SOLRHOST%
> -c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l
> %SOLRLOC%\lastsynonyms.txt -z %FUZZ%
> 
> The perl
> 
> 
> #!/usr/bin/perl
> 
> #
> #     Perl program to run a thread of solr testing
> #
> 
> use Getopt::Std;                      # For options processing
> use POSIX;                            # For time formatting
> use XML::Simple;                      # For processing of XML config file
> use Data::Dumper;                     # For debugging XML config file
> use HTTP::Request::Common;            # For HTTP request to Solr
> use HTTP::Response;
> use LWP::UserAgent;                   # For HTTP request to Solr
> 
> $host = "YOURHOST:8983";
> $realm = "YOUR AUTHENTICATION REALM";
> $firstlist = "firstsynonyms.txt";
> $lastlist = "lastsynonyms.txt";
> $fuzzy = "";
> 
> $me = $0;
> 
> sub usage() {
>       print "perl $me -c iterations [-d] [-h host:port ] [-u user [-p
> password]] \n";
>       print "\t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r
> realm]\n";
>       exit(8);
> }
> 
> 
> #
> #     Process the command line options, and open the output file.
> #
> 
> getopts('dc:u:p:f:l:h:r:z:') || usage();
> 
> if(!$opt_c) {
>       usage();
> }
> 
> $count = $opt_c;
> 
> if($opt_u) {
>       $user = $opt_u;
> }
> 
> if($opt_p) {
>       $password = $opt_p;
> }
> 
> if($opt_h) {
>       $host = $opt_h;
> }
> 
> if($opt_f) {
>       $firstlist = $opt_f;
> }
> 
> if($opt_l) {
>       $lastlist = $opt_l;
> }
> 
> if($opt_r) {
>       $realm = $opt_r;
> }
> 
> if($opt_z) {
>       $fuzzy = "~" . $opt_z;
> }
> 
> $debug = $opt_d;
> 
> 
> #
> #     If the host string does not include a :, add :80
> #
> 
> if($host !~ /:/) {
>       $host = $host . ":80";
> }
> 
> #
> #     Read the lists of first and last names
> #
> 
> open(SYNFILE,"<$firstlist") || die "Can't open first name list
> $firstlist\n";
> while(<SYNFILE>) {
>       @newwords = split /,/;
>       for($i=0; $i <= $#newwords; ++$i) {
>               $newwords[$i] =~ s/^\s+//;
>               $newwords[$i] =~ s/\s+$//;
>               $newwords[$i] = lc($newwords[$i]);
>       }
>       push @firstnames, @newwords;
> }
> close(SYNFILE);
> 
> open(SYNFILE,"<$lastlist") || die "Can't open last name list
> $lastlist\n";
> while(<SYNFILE>) {
>       @newwords = split /,/;
>       for($i=0; $i <= $#newwords; ++$i) {
>               $newwords[$i] =~ s/^\s+//;
>               $newwords[$i] =~ s/\s+$//;
>               $newwords[$i] = lc($newwords[$i]);
>       }
>       push @lastnames, @newwords;
> }
> close(SYNFILE);
> 
> 
> print "$#firstnames First Names, $#lastnames Last Names\n";
> print "User: $user\n";
> 
> 
> my $userAgent = LWP::UserAgent->new(agent => 'solrtest.pl');
> $userAgent->credentials("$host",$realm,$user,$password);
> 
> $uri = "http://$host/solr/select";;
> 
> $starttime = time();
> 
> for($c=0; $c < $count; ++$c) {
>       $fname = $firstnames[rand $#firstnames];
>       $lname = $lastnames[rand $#lastnames];
>       $response = $userAgent->request(
>               POST $uri,
>               [
>                       q => "lnamesyn:$lname AND fnamesyn:$fname$fuzzy",
>                       rows => "25"
>               ]);
> 
>       if($debug) {
>               print "Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy";
>               print $response->content();
>       }
>       print "POST for $fname $lname completed, HTTP status=" .
> $response->code . "\n";
> }
> 
> $elapsed = time() - $starttime;
> $average = $elapsed / $count;
> 
> print "Time: $elapsed s ($average/request)\n";
> 
> 
> -----Original Message-----
> From: Rode Gonzalez (libnova) [mailto:r...@libnova.es]
> Sent: Saturday, August 13, 2011 3:50 AM
> To: solr-user@lucene.apache.org
> Subject: ideas for indexing large amount of pdf docs
> 
> Hi all,
> 
> I want to ask about the best way to implement a solution for indexing a
> large amount of pdf documents between 10-60 MB each one. 100 to 1000
> users
> connected simultaneously.
> 
> I actually have 1 core of solr 3.3.0 and it works fine for a few number
> of
> pdf docs but I'm afraid about the moment when we enter in production
> time.
> 
> some possibilities:
> 
> i. clustering. I have no experience in this, so it will be a bad idea
> to
> venture into this.
> 
> ii. multicore solution. make some kind of hash to choose one core at
> each
> query (exact queries) and thus reduce the size of the individual
> indexes to
> consult or to consult all the cores at same time (complex queries).
> 
> iii. do nothing more and wait for the catastrophe in the response times
> :P
> 
> 
> Someone with experience can help a bit to decide?
> 
> Thanks a lot in advance.
> 
> -----
> No se encontraron virus en este mensaje.
> Comprobado por AVG - www.avg.com
> Versión: 10.0.1392 / Base de datos de virus: 1520/3836 - Fecha de
> publicación: 08/15/11

-----
No se encontraron virus en este mensaje.
Comprobado por AVG - www.avg.com
Versión: 10.0.1392 / Base de datos de virus: 1520/3836 - Fecha de
publicación: 08/15/11

RE: ideas for indexing large amount of pdf docs

Reply via email to