Re: Java heap space error

2014-07-24 Thread Marcello Lorenzi

Hi,
Did you set a Garbage collection strategy on your JVM ?

Marcello

On 07/24/2014 03:32 PM, Ameya Aware wrote:

Hi

I am in process of indexing around 2,00,000 documents.

I have increase java jeap space to 4 GB using below command :

java -Xmx4096M -Xms4096M -jar start.jar

Still after indexing around 15000 documents it gives java heap space error
again.


Any fix for this?

Thanks,
Ameya





Re: Java heap space error

2014-07-24 Thread Marcello Lorenzi
I think that on large heap is suggested to monitor the garbage 
collection behavior and try to add a strategy adapted to your 
performance.  On my production environment with a heap of 6 GB I set 
this parameter (server with 8 cores):


-server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m 
-Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 
-XX:ConcGCThreads=6 -XX:ParallelGCThreads=6


Marcello

On 07/24/2014 03:53 PM, Ameya Aware wrote:
I did not make any other change than this.. rest of the settings are 
default.


Do i need to set garbage collection strategy?


On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi mlore...@sorint.it 
mailto:mlore...@sorint.it wrote:


Hi,
Did you set a Garbage collection strategy on your JVM ?

Marcello


On 07/24/2014 03:32 PM, Ameya Aware wrote:

Hi

I am in process of indexing around 2,00,000 documents.

I have increase java jeap space to 4 GB using below command :

java -Xmx4096M -Xms4096M -jar start.jar

Still after indexing around 15000 documents it gives java heap
space error
again.


Any fix for this?

Thanks,
Ameya







Heap size and Solr 4.3

2013-12-16 Thread Marcello Lorenzi

Hi All,
we have deployed on our production environment a new Solr 4.3 instance 
(2 nodes with SolrCloud) but this morning one node gone on outofmemory 
status and we have noticed that the JVM uses a lot of Old Gen space 
during the normal lifecycle.


What are the items that improve this high usage of Heap?

Thanks,
Marcello


SolR vs large PDF

2013-11-27 Thread Marcello Lorenzi

Hi All,
on our test environment we have implemented a new search engine based on 
Solr 4.3 with 2 instances hosted on different servers and 1 shard 
present on each servlet container.


During some stress test we noticed a bottleneck into crawling of large 
PDF file that blocks the serving of results from queries to the collections.


Is it possible to boost or mitigate the overhead created by PDFBOX 
during the crawling?


Thanks,
Marcello


Re: SolR vs large PDF

2013-11-27 Thread Marcello Lorenzi

Hi Erick,
On our architecture we use Apache Manifoldcf to invoke the schedulation 
from Manifold-web and we use the Manifold-agent to take the pdf file 
from the filesystem to SolR instances. Is it possibile to redirect the 
Manifold schedulation to the SolrJ instance for specific schedules?


Thanks,
Marcello

On 11/27/2013 06:14 PM, Erick Erickson wrote:

I'm assuming you're using the ExtractingRequestHandler. Offloading
the entire work onto your Solr box that is also serving queries
and indexing is not going to scale well.

Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to
offload the PDF parsing amongst as many clients as you can afford.
Here's a way to get started:

http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick


On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi mlore...@sorint.itwrote:


Hi All,
on our test environment we have implemented a new search engine based on
Solr 4.3 with 2 instances hosted on different servers and 1 shard present
on each servlet container.

During some stress test we noticed a bottleneck into crawling of large PDF
file that blocks the serving of results from queries to the collections.

Is it possible to boost or mitigate the overhead created by PDFBOX during
the crawling?

Thanks,
Marcello





Re: PDF indexing issues

2013-11-18 Thread Marcello Lorenzi

Hi,
I have checked the PDF Jira issue but there isn't solution into this 
because some users experienced the same issue with different CMAP 
entries. Could it possible to update the PDFBOX library in the SolR 
installation?


Thanks,
Marcello

On 11/15/2013 06:27 PM, Furkan KAMACI wrote:

You should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940


2013/11/15 Marcello Lorenzi mlore...@sorint.it


Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:

ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont;
Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
ERROR - 2013-11-15 15:14:36.108; org.apache.pdfbox.pdmodel.font.PDCIDFont;
Error: Could not parse predefined CMAP file for '--UCS2'

and

ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter;
FlateFilter: stop reading corrupt stream due to a DataFormatException

Could these errors related to PDF  files format?

Thanks,
Marcello





Re: Solr xml img parsing exception

2013-11-15 Thread Marcello Lorenzi

Hi Jack,
we have analyzed the issue and there were duplicated jar into the tomcat 
classpath for Tika. After the removal of the dulicated library now the 
search engine works as expected.


Thanks for the support,
Marcello

On 11/14/2013 05:24 PM, Jack Krupansky wrote:

The actual error appears to be:

Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.

So, check the input document at line 91, column 105. There should be 
an img tag there, but SAX is complaining that there is no matching 
/img.


-- Jack Krupansky

-Original Message- From: Marcello Lorenzi
Sent: Thursday, November 14, 2013 9:26 AM
To: solr-user@lucene.apache.org
Subject: Solr xml img parsing exception

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf
to pass web content to the shard collection, but during the crawling we
have noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150) 


at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) 


at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) 


at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242) 


at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) 


at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) 


at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) 


at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) 


at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) 


at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107) 


at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) 


at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76) 


at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90) 


at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515) 


at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012) 


at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642) 


at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223) 


at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597) 


at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555) 


at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 


at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 


at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147) 


... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.
at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) 


at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) 


at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) 


at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) 


at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) 


at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753

PDF indexing issues

2013-11-15 Thread Marcello Lorenzi

Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors 
occurred for PDF indexing:


ERROR - 2013-11-15 15:14:26.248; 
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse 
predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
ERROR - 2013-11-15 15:14:36.108; 
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse 
predefined CMAP file for '--UCS2'


and

ERROR - 2013-11-15 15:12:18.928; org.apache.pdfbox.filter.FlateFilter; 
FlateFilter: stop reading corrupt stream due to a DataFormatException


Could these errors related to PDF  files format?

Thanks,
Marcello


Solr xml img parsing exception

2013-11-14 Thread Marcello Lorenzi

Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf 
to pass web content to the shard collection, but during the crawling we 
have noticed a lot of this exception:


ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; 
org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: XML parse error
at 
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642)
at 
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597)
at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147)

... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 
105; The element type img must be terminated by the matching end-tag 
/img.
at 
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at 
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at 
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at 
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at 
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at 

Re: Solr xml img parsing exception

2013-11-14 Thread Marcello Lorenzi

Hi Erik,
but in this case the custom loader receives an HTTP Error 500 by SOLR?

Thanks,
Marcello
On 11/14/2013 04:29 PM, Erik Hatcher wrote:

Also there's a custom loader here that is the culprit:  
com.lsegroup.solr.handler.CwsExtractingDocumentLoader

On Nov 14, 2013, at 10:20, Erick Erickson erickerick...@gmail.com wrote:


It looks like bad data. The XML you're sending to Solr looks mal-formed, so
I
suspect this is completely outside of Solr's purview.

Best,
Erick


On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote:


Hi,
I have installed a Solr 4.3 instance and we have configured manifoldcf to
pass web content to the shard collection, but during the crawling we have
noticed a lot of this exception:

ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
XML parse error
at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
CwsExtractingDocumentLoader.java:150)
at org.apache.solr.handler.ContentStreamHandlerBase.
handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:135)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
handleRequest(RequestHandlers.java:242)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:656)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:359)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:155)
at org.apache.catalina.core.ApplicationFilterChain.
internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:221)
at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:107)
at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:155)
at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:76)
at org.apache.catalina.valves.AccessLogValve.invoke(
AccessLogValve.java:934)
at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:90)
at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:515)
at org.apache.coyote.http11.AbstractHttp11Processor.process(
AbstractHttp11Processor.java:1012)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
process(AbstractProtocol.java:642)
at org.apache.coyote.http11.Http11NioProtocol$
Http11ConnectionHandler.process(Http11NioProtocol.java:223)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
doRun(NioEndpoint.java:1597)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.
run(NioEndpoint.java:1555)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.tika.exception.TikaException: XML parse error
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(
CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(
AutoDetectParser.java:120)
at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(
CwsExtractingDocumentLoader.java:147)
... 24 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber:
105; The element type img must be terminated by the matching end-tag
/img.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
createSAXParseException(ErrorHandlerWrapper.java:198)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.
fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.
XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at com.sun.org.apache.xerces.internal.impl.
XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.
reportFatalError(XMLScanner.java:1388)
at com.sun.org.apache.xerces.internal.impl.
XMLDocumentFragmentScannerImpl.scanEndElement(
XMLDocumentFragmentScannerImpl.java:1753)
at com.sun.org.apache.xerces.internal.impl.
XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(
XMLDocumentFragmentScannerImpl.java:2951)
at com.sun.org.apache.xerces.internal.impl.
XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl