I have created a ticket (CONNECTORS-1444) to track this issue, and attached a fix. I've also committed the fix to trunk.
The fix is not the code change you have done, but instead introduces a new kind of DocumentumException: CORRUPTEDDOCUMENT. This will be thrown whenever permanent document corruption is detected, and will cause the document to be skipped and not indexed. The "DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED " error should cause the connector to retry the document at a later time, so if indeed this is not a permanent error, no special fix should be required. Please let me know if the fix I have committed works for you. Karl On Fri, Jul 14, 2017 at 5:41 AM, Tamizh Kumaran Thamizharasan < [email protected]> wrote: > Hi Karl, > > > > Sorry for not explaining the issue in a detail manner. > > (1) Is it likely to go away or not on a retry; > > The DM_PLATFORM_E_INTEGER_CONVERSION_ERROR and > DM_OBJECT_E_LOAD_INVALID_STRING_LEN > error are not likely to go away on immediate retry. > > (2) Does it substantially impact the ability of ManifoldCF to properly > process the document; > > The impact is someone need to monitor the indexing and if it gets stopped > on these issues, need to use the restart-minimal to start the indexing > again. > > (3) Is it generally acceptable to skip ALL documents where the error > occurs. > > Yes, those errors are occurred for a large number of documents and its > tough time for the user to restart the indexing again. Total documents > count - 700000+ > > DM_OBJECT_E_LOAD_INVALID_STRING_LEN - 11147 > > DM_PLATFORM_E_INTEGER_CONVERSION_ERROR 21708 > > Im not sure whether the occurrences of these issues are common on the > documentum / due to improper documentum configuration/maintenance. We have > encountered those errors on a couple of the documentum instances of lower > environments (Not validated on production). > > > > The documentum repository errors DM_PLATFORM_E_INTEGER_CONVERSION_ERROR > and DM_OBJECT_E_LOAD_INVALID_STRING_LEN are of type DfException caused > from the getObjectByQualification method in the > org.apache.manifoldcf.crawler.common.DCTM.DocumentumImpl. > > > > We made a fix to print the error on the log(documentum server process) and > return null. > > * catch* (DfException e) > > { > > > > e.printStackTrace(); > > *return* *null*; > > //throw new DocumentumException("Documentum error: > "+e.getMessage()); > > } > > > > > > On the run() method of the ProcessDocumentThread inner class on the > org.apache.manifoldcf.crawler.connectors.DCTM.DCTM file, if did a null > check to continue with the document processing. > > *try* > > { > > IDocumentumObject object = session.getObjectByQualification("dm_document > where i_chronicle_id='" + documentIdentifier + > > "' and any r_version_label='CURRENT'"); > > *if*(object!=*null*) { > > … > > } > > } > > *catch* (Throwable e) > > { > > *this*.exception = e; > > } > > > > The [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED error occurs very rarely > due to the document uploaded is parked in interim BOCS and moved to > Repository after a shorter time. > > If indexing happens on the gap, the properties will be accessible, but the > document content will not be available that causes the error. The fix is > not yet completed. > > The code snippet that causes this error is shared below. > > The run() method of the ProcessDocumentThread inner class on the > org.apache.manifoldcf.crawler.connectors.DCTM.DCTM > > * try* > > { > > strFilePath = object.getFile(objFileTemp.getCanonicalPath()); > > } > > *catch* (DocumentumException dfe) > > { > > // Fetch failed, so log it > > activityStatus = "NOCONTENT"; > > activityMessage = dfe.getMessage(); > > *if* (dfe.getType() != DocumentumException.TYPE_NOTALLOWED) > > *throw* dfe; > > *return*; > > } > > > > The getFile method on the org.apache.manifoldcf.crawler.common.DCTM. > DocumentumObjectImpl > > > > *catch* (DfException dfe) > > { > > // Can't decide what to do without looking at the exception text. > > // This is crappy but it's the best we can manage, apparently. > > String errorMessage = dfe.getMessage(); > > *if* (errorMessage.indexOf("[DM_CONTENT_E_CANT_START_PULL]") == -1) > > // Treat it as transient, and retry > > *throw* *new* DocumentumException(dfe.getMessage(), > DocumentumException.TYPE_SERVICEINTERRUPTION); > > // It's probably not a transient error. Report it as an access > violation, even though it > > // may well not be. We don't have much info as to what's happening. > > *throw* *new* DocumentumException(dfe.getMessage(), > DocumentumException.TYPE_NOTALLOWED); > > } > > > > The approach to discard uncrawlable documents and continue with the > indexing process is meaningful rather than stalling it. If you feel it is > good to include, kindly do the required coding exception. > > > > Regards, > > Tamizh Kumaran Thamizharasan > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* Friday, July 14, 2017 12:36 PM > *To:* [email protected] > *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani > *Subject:* Re: Documentum job stops on error > > > > Hi Tamizh, > > > > For any repository errors, ManifoldCF needs to know the following: > > (1) Is it likely to go away or not on a retry; > > (2) Does it substantially impact the ability of ManifoldCF to properly > process the document; > > (3) Is it generally acceptable to skip ALL documents where the error > occurs. > > > > In this case your underlying error seems quite worrying: > > > > [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is > temporarily parked on a BOCS server host. It will be available when it is > moved to a permanent storage area." > > I could imagine that many or most documents are in fact in that state, in > which case nothing can really be crawled? > > > > I'm happy to make coding exceptions in the Documentum connector for > discarding uncrawlable documents, but only if it makes sense to do that. > Here it is not clear at all that we'd want to change MCF to throw away all > documents with this problem. It sounds instead like there's some > significant Documentum configuration issue to me. > > > > Thanks, > > Karl > > > > > > On Fri, Jul 14, 2017 at 2:39 AM, Tamizh Kumaran Thamizharasan < > [email protected]> wrote: > > Hi Team, > > > > Below behavior is observed on using ManifoldCF Documentum connector. > > > > · On any Documentum specific error, the application throws the > error and the job stops abruptly. If there is any specific reason for this > approach? > > Can we handle these errors by logging the errors, ignoring the document > and continue the indexing? > > > > Please find the sample error causing the job to fail. > > > > Documentum error: [DM_PLATFORM_E_INTEGER_CONVERSION_ERROR]error: "The > server was unable to convert the following string (String Unavailable) to > an integer or long." > > > > Caused by: org.apache.manifoldcf.crawler.common.DCTM.DocumentumException: > Documentum error: [DM_OBJECT_E_LOAD_INVALID_STRING_LEN]error: "Error > loading object: invalid string length 0 found in input stream" > > > > Error: Repeated service interruptions - failure processing document: > [DM_SYSOBJECT_E_CONTENT_UNAVAILABLE_PARKED]error: "The content is > temporarily parked on a BOCS server host. It will be available when it is > moved to a permanent storage area." > > > > Kindly provide your suggestion on this. > > > > Regards, > > Tamizh Kumaran Thamizharasan > > > > >
