Ok. The aim of putting it in the connector was mainly not to have to repeat the 
operation for the 300 jobs in production.

 

Cordialement,

 



 

 

 

De : Karl Wright [mailto:[email protected]] 
Envoyé : mardi 9 janvier 2018 15:44
À : [email protected]
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

Since the Tika extractor essentially filters out the content mime type (other 
than presenting it as metadata), you need to put an "allowed documents" 
transformation connection into your job pipeline BEFORE the Tika connector:

 

https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#alloweddocuments

 

In fact, mime type exclusion is actually disabled in the Solr output connector 
*unless* you are using the extracting update handler.  That should resolve the 
one problem for you.

 

Thanks,

Karl

 

 

On Tue, Jan 9, 2018 at 9:35 AM, msaunier <[email protected] 
<mailto:[email protected]> > wrote:

They document for Tika are :

·        Microsoft Word 97-2003

·        Application/msword

 

I can’t have more informations, they are in SCO servers and SCO do not have ls 
–lisan or stat command.

 

For SolR connecting, I seem to have emptied the index before the last 
indexation. (ManifoldCF and Solr) I do it again to be sure.

 

 

De : Karl Wright [mailto:[email protected] <mailto:[email protected]> ] 
Envoyé : mardi 9 janvier 2018 15:26


À : [email protected] <mailto:[email protected]> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1482 is for the Solr connector filtering issue.  A question: When 
you changed these fields in the output connection, had you already indexed any 
documents?  Those would only get cleaned up if you did a subsequent full crawl, 
after you made the connection change.

 

Karl

 

 

 

On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <[email protected] 
<mailto:[email protected]> > wrote:

If you let me know what kind of file they are (extension and what application 
created them) that is probably good enough.

Karl

 

On Tue, Jan 9, 2018 at 9:19 AM, msaunier <[email protected] 
<mailto:[email protected]> > wrote:

Okay good. I look if I can test 1.17 Tika version.

 

I can’t transfert a document with this error, they are privates. Sorry.

 

If I encounter the error again on a non-private document, I'll come back to you.

 

 

 

De : Karl Wright [mailto:[email protected] <mailto:[email protected]> ] 
Envoyé : mardi 9 janvier 2018 15:12


À : [email protected] <mailto:[email protected]> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

CONNECTORS-1481 is the ticket for the Tika problem.

 

Karl

 

 

On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <[email protected] 
<mailto:[email protected]> > wrote:

Ok, if you are in a position to build trunk, that's a newer version of Tika 
(1.17) which might (or might not) address this problem.

 

If you could create a ticket, I'd greatly appreciate attaching one document to 
it that causes the failure.

 

Thanks!

Karl

 

 

On Tue, Jan 9, 2018 at 8:02 AM, msaunier <[email protected] 
<mailto:[email protected]> > wrote:

It’s a 2.9 version.

 

I have a 2.8.1 in an other server with same job and same documents. I will test 
on this other server and make you a return.

 

Thanks for your help.

 

De : Karl Wright [mailto:[email protected] <mailto:[email protected]> ] 
Envoyé : mardi 9 janvier 2018 13:15
À : [email protected] <mailto:[email protected]> 
Objet : Re: Document connector excluding mime type and size - Tika Parser error

 

I looked at the history of this.  We had to release a patch (2.8.1) that put 
various poi jars at root level in order to work around a Tika problem.  That 
patch may not have been entirely correct in that it looks like it may have 
blocked access by one of the deeper jars to a higher level.

 

Release 2.9 should fix this if I am correct.

 

Karl

 

 

On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <[email protected] 
<mailto:[email protected]> > wrote:

What version of MCF is this?  That's important to know since Tika has had 
problems with this kind of thing in the past and this looks like something 
similar.

 

The problem you are reporting is due to either a missing jar, or a bug in an 
internal tika classloader.  But I need to know whether this is a current bug or 
not, since we just went to a new Tika version.

 

Karl

 

 

On Tue, Jan 9, 2018 at 4:32 AM, msaunier <[email protected] 
<mailto:[email protected]> > wrote:

Hello Karl,

I hope you are well today.

 

I have 2 problems with ManifoldCF.

 

-----------

In *Outputs connectors* with Solr connector. I have add a « Maximum document 
length and I have « Excluded 5 mime types » but it not work. I join capture.

 

----------

And in second, I have a *Tika exception* in ManifoldCF. 3 documents are blocked 
:

 

FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: 
org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

java.lang.NoSuchMethodError: 
org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset;

        at org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) 
~[?:?]

        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]

        at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
 ~[?:?]

        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:375)
 ~[?:?]

        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:260)
 ~[?:?]

        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:205)
 ~[?:?]

        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:142)
 ~[?:?]

        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:142)
 ~[?:?]

        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) 
~[?:?]

        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]

        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[?:?]

        at 
org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
 ~[?:?]

        at 
org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
 ~[?:?]

        at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
 ~[mcf-agents.jar:?]

        at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
 ~[mcf-agents.jar:?]

        at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)
 ~[mcf-agents.jar:?]

        at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
 ~[mcf-agents.jar:?]

        at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)
 ~[mcf-pull-agent.jar:?]

        at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)
 ~[mcf-pull-agent.jar:?]

        at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:939)
 ~[?:?]

        at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
[mcf-pull-agent.jar:?]

 

I need to create an incident ticket?

 

----------

 

Thanks for your help.

 

Cordialement,

 



 

 

 

 

 

 

 

 

 

 

Reply via email to