Re: how to avoid OutOfMemoryError while indexing ?
You should set your RAMBufferSizeMB to something smaller than the full heap size of your JVM. Mike McCandless http://blog.mikemccandless.com On Sat, Jan 26, 2013 at 11:39 PM, wgggfiy wuqiu.m...@qq.com wrote: I found it is very easy to come into OutOfMemoryError. My idea is that lucene could set the RAM memory Automatically, but I couldn't find the API. My code: IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); int mb = 1024 * 1024; double ram = Runtime.getRuntime().maxMemory() / mb; iwc.setRAMBufferSizeMB(ram); but still OutOfMemoryError, can anyone help me ? thx - -- Email: wuqiu.m...@qq.com -- -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-avoid-OutOfMemoryError-while-indexing-tp4036484.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [2] http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [3] http://tika.apache.org/mail-lists.html -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead. You could take a look at the Solr Extracting Request Handler source for an example: http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/ Basically, Tika extracts a bunch of metadata and then you will have to add selected metadata to your Lucene documents. content is the main document body text. You could try Solr itself to see how it works: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Adrien Grand Sent: Sunday, January 27, 2013 12:53 PM To: java-user@lucene.apache.org Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content Have you tried using the PDFParser [1] and the OfficeParser [2] classes from Tika? This question seems to be more appropriate for the Tika user mailing list [3]? [1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [2] http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream, org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, org.apache.tika.parser.ParseContext) [3] http://tika.apache.org/mail-lists.html -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika framework good enough or is there any other better library. Any issues/experiences in using the tika framework. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Readers-for-extracting-textual-info-from-pd-doc-excel-for-indexing-the-actual-content-tp4036379p4036557.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content
Re-read my last message - and then take a look at that Solr source code, which will give you an idea how to use Tika, even though you are using Lucene only. If you have specific questions, please be specific. To answer your latest question, yes, Tika is good enough. Solr /update/extract uses it, and Solr is based on Lucene. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, January 27, 2013 2:09 PM To: java-user@lucene.apache.org Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content We are not using Solr and using just Lucene core 4.0 engine. I am trying to see if we can use tika library to extract textual information from pdf/word/excel documents. I am mainly interested in reading the contents inside the documents and index using lucene. My question here is , is tika framework good enough or is there any other better library. Any issues/experiences in using the tika framework. Thanks, Sai. -- View this message in context: http://lucene.472066.n3.nabble.com/Readers-for-extracting-textual-info-from-pd-doc-excel-for-indexing-the-actual-content-tp4036379p4036557.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org