RE: ideas for indexing large amount of pdf docs

2011-08-16 Thread Rode González
] Enviado el: lunes, 15 de agosto de 2011 14:54 Para: solr-user@lucene.apache.org Asunto: RE: ideas for indexing large amount of pdf docs Note on i: Solr replication provides pretty good clustering support out-of-the-box, including replication of multiple cores. Read the Wiki on replication

RE: ideas for indexing large amount of pdf docs

2011-08-15 Thread Jaeger, Jay - DOT
Note on i: Solr replication provides pretty good clustering support out-of-the-box, including replication of multiple cores. Read the Wiki on replication (Google +solr +replication if you don't know where it is). In my experience, the problem with indexing PDFs is it takes a lot of CPU on

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
this time all as possible when we entering in production time. Best, Rode. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell
You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson erickerick...@gmail.com wrote:

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
@lucene.apache.org Date: Sat, 13 Aug 2011 15:34:19 -0400 Subject: Re: ideas for indexing large amount of pdf docs Ahhh, ok, my reply was irrelevant G... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr [http://www.lucidimagination.com/content