Re: how to avoid OutOfMemoryError while indexing ?

2013-01-27 Thread Michael McCandless
You should set your RAMBufferSizeMB to something smaller than the full heap size of your JVM. Mike McCandless http://blog.mikemccandless.com On Sat, Jan 26, 2013 at 11:39 PM, wgggfiy wuqiu.m...@qq.com wrote: I found it is very easy to come into OutOfMemoryError. My idea is that lucene could

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Adrien Grand
Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler,

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead. You could take a look at the Solr Extracting Request Handler source for an example:

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread saisantoshi
We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika

Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
Re-read my last message - and then take a look at that Solr source code, which will give you an idea how to use Tika, even though you are using Lucene only. If you have specific questions, please be specific. To answer your latest question, yes, Tika is good enough. Solr /update/extract uses