Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I 
will study the master/slave architecture for many slaves. In the future 
perhaps we will need it =)

Best regards,

-----Original Message-----

From: Erick Erickson <>


Date: Sat, 13 Aug 2011 15:34:19 -0400

Subject: Re: ideas for indexing large amount of pdf docs

Ahhh, ok, my reply was irrelevant <G>...

Here's a good write-up on this problem: 

But Solr handles millions of documents on a single server in many cases,

so waiting until the search app falls over is actually feasible.

In general, if you can get an adequate query response time from a single

machine, you just set up a master/slave architecture and add as many slaves

as you need to handle your maximum load. So scaling wide is a very quick

process. Don't go to sharding unless and until your machine can't give 

response times at all...

Mark's paper outlines this very well.



On Sat, Aug 13, 2011 at 2:13 PM, Rode Gonzalez (libnova)

<> wrote:

> Hi Erick,


> Our app insert the pdf from a backoffice site and the people can

> search/consult throught a front end site. Both written in php. I've

> installed a tomcat for solr exclusivelly.


> the pdf docs are indexed and not stored using the standard

> solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars

> included in contrib/extraction dir, you know) in an offline mode

> (summarizing: the internal users submit the docs; this docs were saved in

> the server; there is a task that take the docs and put them into the 

> throught a curl utility; when the task finish, the doc is available to the

> frontend; once more, we use curl utilities to make queries to solr).


> The problem isn't the process of indexing. The max injection rate can be

> 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i

> don't know exactly... but a lot of them,so many books in a library.


> But no problem about this, this part of the process runs offline. take a

> doc, index a doc; take another doc, index another doc, ...


> The problem is the response time when the number of pdf's grow and grow...

> How is the better manner, the best way, the fantastic idea to minimize 

> time all as possible when we entering in production time.


> Best,


> Rode.



> -----Original Message-----


> From: Erick Erickson <>


> To:


> Date: Sat, 13 Aug 2011 12:13:27 -0400


> Subject: Re: ideas for indexing large amount of pdf docs





> Yeah, parsing PDF files can be pretty resource-intensive, so one solution


> is to offload it somewhere else. You can use the Tika libraries in SolrJ


> to parse the PDFs on as many clients as you want, just transmitting the


> results to Solr for indexing.




> HOw are all these docs being submitted? Is this some kind of


> on-the-fly indexing/searching or what? I'm mostly curious what


> your projected max ingestion rate is...




> Best


> Erick




> On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)


> <> wrote:


>> Hi all,




>> I want to ask about the best way to implement a solution for indexing a


>> large amount of pdf documents between 10-60 MB each one. 100 to 1000 


>> connected simultaneously.




>> I actually have 1 core of solr 3.3.0 and it works fine for a few number 


>> pdf docs but I'm afraid about the moment when we enter in production 




>> some possibilities:




>> i. clustering. I have no experience in this, so it will be a bad idea to


>> venture into this.




>> ii. multicore solution. make some kind of hash to choose one core at each


>> query (exact queries) and thus reduce the size of the individual indexes

> to


>> consult or to consult all the cores at same time (complex queries).




>> iii. do nothing more and wait for the catastrophe in the response times 






>> Someone with experience can help a bit to decide?




>> Thanks a lot in advance.




Reply via email to