Re: ideas for indexing large amount of pdf docs

Rode Gonzalez (libnova) Sat, 13 Aug 2011 14:00:03 -0700

Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I 
will study the master/slave architecture for many slaves. In the future 
perhaps we will need it =)


Best regards,
Rode.


-----Original Message-----

From: Erick Erickson <erickerick...@gmail.com>

To: solr-user@lucene.apache.org

Date: Sat, 13 Aug 2011 15:34:19 -0400

Subject: Re: ideas for indexing large amount of pdf docs




Ahhh, ok, my reply was irrelevant <G>...



Here's a good write-up on this problem:

http://www.lucidimagination.com/content/scaling-lucene-and-solr 
[http://www.lucidimagination.com/content/scaling-lucene-and-solr]



But Solr handles millions of documents on a single server in many cases,

so waiting until the search app falls over is actually feasible.



In general, if you can get an adequate query response time from a single

machine, you just set up a master/slave architecture and add as many slaves

as you need to handle your maximum load. So scaling wide is a very quick

process. Don't go to sharding unless and until your machine can't give 
adequate

response times at all...



Mark's paper outlines this very well.



Best

Erick



On Sat, Aug 13, 2011 at 2:13 PM, Rode Gonzalez (libnova)

<r...@libnova.es> wrote:

> Hi Erick,

>

> Our app insert the pdf from a backoffice site and the people can

> search/consult throught a front end site. Both written in php. I've

> installed a tomcat for solr exclusivelly.

>

> the pdf docs are indexed and not stored using the standard

> solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars

> included in contrib/extraction dir, you know) in an offline mode

> (summarizing: the internal users submit the docs; this docs were saved in

> the server; there is a task that take the docs and put them into the 
indexer

> throught a curl utility; when the task finish, the doc is available to the

> frontend; once more, we use curl utilities to make queries to solr).

>

> The problem isn't the process of indexing. The max injection rate can be

> 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i

> don't know exactly... but a lot of them,so many books in a library.

>

> But no problem about this, this part of the process runs offline. take a

> doc, index a doc; take another doc, index another doc, ...

>

> The problem is the response time when the number of pdf's grow and grow...

> How is the better manner, the best way, the fantastic idea to minimize 
this

> time all as possible when we entering in production time.

>

> Best,

>

> Rode.

>

>

> -----Original Message-----

>

> From: Erick Erickson <erickerick...@gmail.com>

>

> To: solr-user@lucene.apache.org

>

> Date: Sat, 13 Aug 2011 12:13:27 -0400

>

> Subject: Re: ideas for indexing large amount of pdf docs

>

>

>

>

> Yeah, parsing PDF files can be pretty resource-intensive, so one solution

>

> is to offload it somewhere else. You can use the Tika libraries in SolrJ

>

> to parse the PDFs on as many clients as you want, just transmitting the

>

> results to Solr for indexing.

>

>

>

> HOw are all these docs being submitted? Is this some kind of

>

> on-the-fly indexing/searching or what? I'm mostly curious what

>

> your projected max ingestion rate is...

>

>

>

> Best

>

> Erick

>

>

>

> On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)

>

> <r...@libnova.es> wrote:

>

>> Hi all,

>

>>

>

>> I want to ask about the best way to implement a solution for indexing a

>

>> large amount of pdf documents between 10-60 MB each one. 100 to 1000 
users

>

>> connected simultaneously.

>

>>

>

>> I actually have 1 core of solr 3.3.0 and it works fine for a few number 
of

>

>> pdf docs but I'm afraid about the moment when we enter in production 
time.

>

>>

>

>> some possibilities:

>

>>

>

>> i. clustering. I have no experience in this, so it will be a bad idea to

>

>> venture into this.

>

>>

>

>> ii. multicore solution. make some kind of hash to choose one core at each

>

>> query (exact queries) and thus reduce the size of the individual indexes

> to

>

>> consult or to consult all the cores at same time (complex queries).

>

>>

>

>> iii. do nothing more and wait for the catastrophe in the response times 
:P

>

>>

>

>>

>

>> Someone with experience can help a bit to decide?

>

>>

>

>> Thanks a lot in advance.

>

>>

>

Re: ideas for indexing large amount of pdf docs

Reply via email to