Hi Nikita,

This code loads the entire document into memory:

>>>>>>

          String document = getAconexBody(aconexFile);
          try
          {
            byte[] documentBytes = document.getBytes(StandardCharsets.UTF_8);
            long fileLength = documentBytes.length;

            if (!activities.checkLengthIndexable(fileLength))
            {
              errorCode = activities.EXCLUDED_LENGTH;
              errorDesc = "Excluded because of document length
("+fileLength+")";
              activities.noDocument(documentIdentifier, versionString);
              continue;
            }

<<<<<<


Obviously that will not work.  You will need to find another way.

Since it appears there's a file you have available, why not just
stream from the file?  You are reading it as UTF-8 but frankly you
could just stream it as binary without loading it into memory at all.
Just open it as a FileInputStream and put that into your
RepositoryDocument.  Make sure you have a try/finally to close the
stream after the index method has been invoked.


Karl



On Wed, Aug 29, 2018 at 5:18 AM Nikita Ahuja <[email protected]> wrote:

> Hi Karl,
>
>
> Yes the documents are ingesting in the output connector without any error.
> But after executing the process for about 2k-3k documents the service
> crashes and displays message for "Out Of Memory".
>
> The  checkLengthIndexable() method is used first before ingesting the
> document.
>
> Please have look on the attachment for the methods which might are the
> problem area.
>
> On Wed, Aug 29, 2018 at 1:44 PM, Karl Wright <[email protected]> wrote:
>
>> So the Allowed Document transformer is now working, and your connector is
>> now skipping documents that are too large, correct?  But you are still
>> seeing out of memory errors?
>>
>> Does your connector load the entire document into memory before it calls
>> checkLengthIndexable()?  Because if it does, that will not work.  There is
>> a reason that connectors are constructed to stream data in MCF.
>>
>> It might be faster to diagnose your problem if you made the source code
>> available so that I could audit it.
>>
>> Karl
>>
>>
>> On Wed, Aug 29, 2018 at 2:42 AM Nikita Ahuja <[email protected]>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> The result for both the Length and checkLengthIndexable() method is
>>> same. And the Allowed  Document is also working. But main problem is
>>> crashing down of the service and it displays memory Leakage error every
>>> time after crawling few set of documents..
>>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 6:48 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Can you add logging messages to your connector to log (1) the length
>>>> that it sees, and (2) the result of checkLengthIndexable()?  And then,
>>>> please once again add the Allowed Documents transformer and set a
>>>> reasonable document length.  Run the job and see why it is rejecting your
>>>> documents.
>>>>
>>>> All of our shipping connectors use this logic and it does work, so I am
>>>> rather certain that the problem is in your connector.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 8:54 AM Nikita Ahuja <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Thank you for valuable suggestion.
>>>>>
>>>>> The checkLengthIndexable() value is also used in the code and it is
>>>>> returning the exact value for document length.
>>>>>
>>>>> Also garbage collector and disposing for the threads is used.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 5:44 PM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I don't see checkLengthIndexable() in this list.  You need to add
>>>>>> that if you want your connector to be able to not try and index documents
>>>>>> that are too big.
>>>>>>
>>>>>> You said before that when you added the Allowed Documents transformer
>>>>>> to the chain it removed ALL documents, so I suspect it's there but you 
>>>>>> are
>>>>>> not sending in the actual document length.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 8:10 AM Nikita Ahuja <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> These methods are already in use with the connector in the code
>>>>>>> where file is need to read and ingest in the output.
>>>>>>>
>>>>>>> (!activities.checkURLIndexable(fileUrl))
>>>>>>> (!activities.checkMimeTypeIndexable(contentType))
>>>>>>> (!activities.checkDateIndexable(modifiedDate))
>>>>>>>
>>>>>>>
>>>>>>> But this service crashes after crawling approx 2000 documents.
>>>>>>>
>>>>>>> I think there is some other thing hitting it and creating problem.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 24, 2018 at 8:33 PM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Nikita,
>>>>>>>>
>>>>>>>> Until you fix your connector, nothing can be done to address your
>>>>>>>> Out Of Memory problem.
>>>>>>>>
>>>>>>>> The problem is that you are not calling the following
>>>>>>>> IProcessActivity method:
>>>>>>>>
>>>>>>>>   /** Check whether a document of a specific length is indexable by
>>>>>>>> the currently specified output connector.
>>>>>>>>   *@param length is the document length.
>>>>>>>>   *@return true if the document is indexable.
>>>>>>>>   */
>>>>>>>>   public boolean checkLengthIndexable(long length)
>>>>>>>>     throws ManifoldCFException, ServiceInterruption;
>>>>>>>>
>>>>>>>> Your connector should call this and honor the response.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2018 at 9:55 AM Nikita Ahuja <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> I have checked for the coding error, there is nothing like that
>>>>>>>>> as"Allowed Document" is working fine for same code on the other 
>>>>>>>>> system.
>>>>>>>>>
>>>>>>>>> But now main issue being faced is "Shutting down of the
>>>>>>>>> ManifoldCF" and it shows *"java.lang.OutOfMemoryError: GC
>>>>>>>>> overhead limit exceeded" on the system.*
>>>>>>>>>
>>>>>>>>> Postgresql is being used for Manifoldcf and the memory alloted for
>>>>>>>>> the system is very good, but still this issue is faced very 
>>>>>>>>> frequently.
>>>>>>>>> Throttling(2) and Worker thread size"45" is also being checked and
>>>>>>>>> as per the documentation it is checked for different values.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please suggest the possible problem area and steps to be taken.
>>>>>>>>>
>>>>>>>>> On Mon, Aug 20, 2018 at 7:30 PM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Obviously your Allowed Documents filter is somehow causing all
>>>>>>>>>> documents to be excluded.  Since you have a custom repository 
>>>>>>>>>> connector I
>>>>>>>>>> would bet there is a coding error in it that is responsible.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reply.
>>>>>>>>>>>
>>>>>>>>>>> I am using in the same sequence. The allowed document is added
>>>>>>>>>>> first and then the Tika Transformation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But nothing runs in that scenario. The job simply ends without
>>>>>>>>>>> returning anything in the output.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright <[email protected]
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> You are running out of memory.
>>>>>>>>>>>> Tika's memory consumption is not well defined so you will need
>>>>>>>>>>>> to limit the size of documents that reach it.  This is not the 
>>>>>>>>>>>> same as
>>>>>>>>>>>> limiting the size of documents *after* Tika extracts them.
>>>>>>>>>>>>
>>>>>>>>>>>> The Allowed Documents transformer therefore should be placed in
>>>>>>>>>>>> the pipeline before the Tika Extractor.
>>>>>>>>>>>>
>>>>>>>>>>>> "Also it is not compatible with the Allowed Documents and
>>>>>>>>>>>> Metadata Adjuster Connectors."
>>>>>>>>>>>>
>>>>>>>>>>>> This is a huge red flag.  Why not?
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 20, 2018 at 6:47 AM Nikita Ahuja <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is a custom job executing for Aconex in the ManifoldCF
>>>>>>>>>>>>> environment. But while executing it is not able to crawl complete 
>>>>>>>>>>>>> set of
>>>>>>>>>>>>> documents. It crashes in the middle of the execution.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also it is not compatible with the Allowed Documents and
>>>>>>>>>>>>> Metadata Adjuster Connectors.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The custom job created is similar to the existing Jira
>>>>>>>>>>>>> connector in the ManifoldCF.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And it showing this type of error. Please suggest appropriate
>>>>>>>>>>>>> steps which needs to be followed to make it smoothly running.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>>>>>> [uk1.aconex.co.uk/---.---.---.---
>>>>>>>>>>>>> <http://uk1.aconex.co.uk/---.---.---.--->] failed: Read timed out*
>>>>>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:240)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1361)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1327)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.jobs.JobManager.assessMarkedJobs(JobManager.java:823)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.AssessmentThread.run(AssessmentThread.java:65)*
>>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.clone(PDGraphicsState.java:494)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.saveGraphicsState(PDFStreamEngine.java:898)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:721)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:587)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexSession.fetchAndIndexFile(AconexSession.java:720)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexRepositoryConnector.processDocuments(AconexRepositoryConnector.java:1194)*
>>>>>>>>>>>>> *        at
>>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)*
>>>>>>>>>>>>> *[Thread-431] INFO org.eclipse.jetty.server.ServerConnector -
>>>>>>>>>>>>> Stopped ServerConnector@2c0b4c83{HTTP/1.1}{0.0.0.0:8345
>>>>>>>>>>>>> <http://0.0.0.0:8345>}*
>>>>>>>>>>>>> *[Thread-431] INFO
>>>>>>>>>>>>> org.eclipse.jetty.server.handler.ContextHandler - Stopped
>>>>>>>>>>>>> o.e.j.w.WebAppContext@4c03a37{/mcf-api-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-3117653580650249372.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-api-service.war}*
>>>>>>>>>>>>> *[Thread-431] INFO
>>>>>>>>>>>>> org.eclipse.jetty.server.handler.ContextHandler - Stopped
>>>>>>>>>>>>> o.e.j.w.WebAppContext@65ae095c{/mcf-authority-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-8288503227579256193.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-authority-service.war}*
>>>>>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>>>>>> [uk1.aconex.co.uk/23.10.35.84 
>>>>>>>>>>>>> <http://uk1.aconex.co.uk/23.10.35.84>]
>>>>>>>>>>>>> failed: Read timed out*
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>> Nikita
>>>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>>>>>> a "Smartshore" Company
>>>>>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> Nikita
>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>>>> a "Smartshore" Company
>>>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Nikita
>>>>>>>>> Email: [email protected]
>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>> a "Smartshore" Company
>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Nikita
>>>>>>> Email: [email protected]
>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>> a "Smartshore" Company
>>>>>>> Mobile: +91 99 888 57720
>>>>>>> http://www.smartshore.nl
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>> Email: [email protected]
>>>>> United Sources Service Pvt. Ltd.
>>>>> a "Smartshore" Company
>>>>> Mobile: +91 99 888 57720
>>>>> http://www.smartshore.nl
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: [email protected]
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
>
> --
> Thanks and Regards,
> Nikita
> Email: [email protected]
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Reply via email to