Obviously your Allowed Documents filter is somehow causing all documents to be excluded. Since you have a custom repository connector I would bet there is a coding error in it that is responsible.
Karl On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja <[email protected]> wrote: > Hi Karl, > > Thanks for reply. > > I am using in the same sequence. The allowed document is added first and > then the Tika Transformation. > > > > > But nothing runs in that scenario. The job simply ends without returning > anything in the output. > > > > > > > On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright <[email protected]> wrote: > >> Hi, >> >> You are running out of memory. >> Tika's memory consumption is not well defined so you will need to limit >> the size of documents that reach it. This is not the same as limiting the >> size of documents *after* Tika extracts them. >> >> The Allowed Documents transformer therefore should be placed in the >> pipeline before the Tika Extractor. >> >> "Also it is not compatible with the Allowed Documents and Metadata >> Adjuster Connectors." >> >> This is a huge red flag. Why not? >> >> Karl >> >> >> On Mon, Aug 20, 2018 at 6:47 AM Nikita Ahuja <[email protected]> >> wrote: >> >>> Hi Karl, >>> >>> There is a custom job executing for Aconex in the ManifoldCF >>> environment. But while executing it is not able to crawl complete set of >>> documents. It crashes in the middle of the execution. >>> >>> Also it is not compatible with the Allowed Documents and Metadata >>> Adjuster Connectors. >>> >>> The custom job created is similar to the existing Jira connector in the >>> ManifoldCF. >>> >>> And it showing this type of error. Please suggest appropriate steps >>> which needs to be followed to make it smoothly running. >>> >>> >>> >>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443> >>> [uk1.aconex.co.uk/---.---.---.--- >>> <http://uk1.aconex.co.uk/---.---.---.--->] failed: Read timed out* >>> *agents process ran out of memory - shutting down* >>> *agents process ran out of memory - shutting down* >>> *agents process ran out of memory - shutting down* >>> *agents process ran out of memory - shutting down* >>> *java.lang.OutOfMemoryError: Java heap space* >>> *java.lang.OutOfMemoryError: Java heap space* >>> *java.lang.OutOfMemoryError: Java heap space* >>> * at >>> org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:240)* >>> * at >>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1361)* >>> * at >>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1327)* >>> * at >>> org.apache.manifoldcf.crawler.jobs.JobManager.assessMarkedJobs(JobManager.java:823)* >>> * at >>> org.apache.manifoldcf.crawler.system.AssessmentThread.run(AssessmentThread.java:65)* >>> *java.lang.OutOfMemoryError: Java heap space* >>> * at >>> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.clone(PDGraphicsState.java:494)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.saveGraphicsState(PDFStreamEngine.java:898)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:721)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:587)* >>> * at >>> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)* >>> * at >>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)* >>> * at >>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)* >>> * at >>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)* >>> * at >>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)* >>> * at >>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)* >>> * at >>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)* >>> * at >>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)* >>> * at >>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)* >>> * at >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)* >>> * at >>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)* >>> * at >>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)* >>> * at >>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)* >>> * at >>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)* >>> * at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)* >>> * at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)* >>> * at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)* >>> * at >>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)* >>> * at >>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)* >>> * at >>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)* >>> * at >>> org.apache.manifoldcf.crawler.connectors.aconex.AconexSession.fetchAndIndexFile(AconexSession.java:720)* >>> * at >>> org.apache.manifoldcf.crawler.connectors.aconex.AconexRepositoryConnector.processDocuments(AconexRepositoryConnector.java:1194)* >>> * at >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)* >>> *[Thread-431] INFO org.eclipse.jetty.server.ServerConnector - Stopped >>> ServerConnector@2c0b4c83{HTTP/1.1}{0.0.0.0:8345 <http://0.0.0.0:8345>}* >>> *[Thread-431] INFO org.eclipse.jetty.server.handler.ContextHandler - >>> Stopped >>> o.e.j.w.WebAppContext@4c03a37{/mcf-api-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-3117653580650249372.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-api-service.war}* >>> *[Thread-431] INFO org.eclipse.jetty.server.handler.ContextHandler - >>> Stopped >>> o.e.j.w.WebAppContext@65ae095c{/mcf-authority-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-8288503227579256193.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-authority-service.war}* >>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443> >>> [uk1.aconex.co.uk/23.10.35.84 <http://uk1.aconex.co.uk/23.10.35.84>] >>> failed: Read timed out* >>> -- >>> Thanks and Regards, >>> Nikita >>> Email: [email protected] >>> United Sources Service Pvt. Ltd. >>> a "Smartshore" Company >>> Mobile: +91 99 888 57720 >>> http://www.smartshore.nl >>> >> > > > -- > Thanks and Regards, > Nikita > Email: [email protected] > United Sources Service Pvt. Ltd. > a "Smartshore" Company > Mobile: +91 99 888 57720 > http://www.smartshore.nl >
