Any news? Karl On Tue, Jan 9, 2018 at 1:10 PM, Karl Wright <[email protected]> wrote:
> Let me know what happens. > If it works for you, I'll see if we can put together a patch release of > 2.9 with the fix. > > Karl > > > On Tue, Jan 9, 2018 at 11:07 AM, msaunier <[email protected]> wrote: > >> Test check out and building with POI 3.17 and Tika 1.17? >> >> >> >> It’s possible. >> >> >> >> I finish a project and I test that. >> >> >> >> *De :* Karl Wright [mailto:[email protected]] >> *Envoyé :* mardi 9 janvier 2018 16:57 >> >> *À :* [email protected] >> *Objet :* Re: Document connector excluding mime type and size - Tika >> Parser error >> >> >> >> So here's the problem; we used POI 3.17 with Tika 3.16 in 2.9, in order >> to deal with the classloader issue present in POI 3.15, and because POI >> 3.16 has a severe security issue that made it impossible to ship with. >> >> >> >> Unfortunately that doesn't quite work; POI 3.17 is not backwards >> compatible with 3.16 completely and therefore problems occur with this >> combination. >> >> >> >> The probable solution is to check out and build trunk and see if that >> works for you. It very well might. The question then is what to do next, >> because we are not scheduled to release again until April. We might have >> to do a point release to deal with this. >> >> >> >> Please give it a try and let me know what happens. >> >> >> >> Thanks, >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 10:29 AM, Karl Wright <[email protected]> wrote: >> >> Ok, never mind that last email. We patched it in part in 2.9 by >> including the latest POI. So clearly it's still an existing problem in >> POI. I'll have to open a ticket there and await a patch from them. >> >> >> >> Karl >> >> >> >> On Tue, Jan 9, 2018 at 10:27 AM, Karl Wright <[email protected]> wrote: >> >> This screenshot cannot be MCF 2.9 since the version of poi was not 3.17 >> for the 2.9 release. >> >> >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 10:02 AM, msaunier <[email protected]> wrote: >> >> They 2 versions (2.8.1 and 2.9) of ManifoldCF are on 2 differents servers. >> >> >> >> >> >> >> >> >> >> *De :* Karl Wright [mailto:[email protected]] >> *Envoyé :* mardi 9 janvier 2018 15:54 >> >> >> *À :* [email protected] >> *Objet :* Re: Document connector excluding mime type and size - Tika >> Parser error >> >> >> >> As for the Tika issue, we explicitly tested documents of that type when >> rolling out 2.8.1. When we updated 2.8.1 to a new Tika in 2.9 I believe we >> also tested this. >> >> >> >> One of the potential issues is that if you are dropping down different >> versions of ManifoldCF into the same directories you *might* have a poi* >> jar in the wrong place because of the way we had to do the patch. Please >> have a look at where the poi* jars are in your directory structure; they >> should all be in one directory (connector-common-lib). If you see any >> anywhere else, that's the cause of the issue. >> >> >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 9:43 AM, Karl Wright <[email protected]> wrote: >> >> Since the Tika extractor essentially filters out the content mime type >> (other than presenting it as metadata), you need to put an "allowed >> documents" transformation connection into your job pipeline BEFORE the Tika >> connector: >> >> >> >> https://manifoldcf.apache.org/release/release-2.9/en_US/end- >> user-documentation.html#alloweddocuments >> >> >> >> In fact, mime type exclusion is actually disabled in the Solr output >> connector *unless* you are using the extracting update handler. That >> should resolve the one problem for you. >> >> >> >> Thanks, >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 9:35 AM, msaunier <[email protected]> wrote: >> >> They document for Tika are : >> >> · Microsoft Word 97-2003 >> >> · Application/msword >> >> >> >> I can’t have more informations, they are in SCO servers and SCO do not >> have ls –lisan or stat command. >> >> >> >> For SolR connecting, I seem to have emptied the index before the last >> indexation. (ManifoldCF and Solr) I do it again to be sure. >> >> >> >> >> >> *De :* Karl Wright [mailto:[email protected]] >> *Envoyé :* mardi 9 janvier 2018 15:26 >> >> >> *À :* [email protected] >> *Objet :* Re: Document connector excluding mime type and size - Tika >> Parser error >> >> >> >> CONNECTORS-1482 is for the Solr connector filtering issue. A question: >> When you changed these fields in the output connection, had you already >> indexed any documents? Those would only get cleaned up if you did a >> subsequent full crawl, after you made the connection change. >> >> >> >> Karl >> >> >> >> >> >> >> >> On Tue, Jan 9, 2018 at 9:22 AM, Karl Wright <[email protected]> wrote: >> >> If you let me know what kind of file they are (extension and what >> application created them) that is probably good enough. >> >> Karl >> >> >> >> On Tue, Jan 9, 2018 at 9:19 AM, msaunier <[email protected]> wrote: >> >> Okay good. I look if I can test 1.17 Tika version. >> >> >> >> I can’t transfert a document with this error, they are privates. Sorry. >> >> >> >> If I encounter the error again on a non-private document, I'll come back >> to you. >> >> >> >> >> >> >> >> *De :* Karl Wright [mailto:[email protected]] >> *Envoyé :* mardi 9 janvier 2018 15:12 >> >> >> *À :* [email protected] >> *Objet :* Re: Document connector excluding mime type and size - Tika >> Parser error >> >> >> >> CONNECTORS-1481 is the ticket for the Tika problem. >> >> >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 8:34 AM, Karl Wright <[email protected]> wrote: >> >> Ok, if you are in a position to build trunk, that's a newer version of >> Tika (1.17) which might (or might not) address this problem. >> >> >> >> If you could create a ticket, I'd greatly appreciate attaching one >> document to it that causes the failure. >> >> >> >> Thanks! >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 8:02 AM, msaunier <[email protected]> wrote: >> >> It’s a 2.9 version. >> >> >> >> I have a 2.8.1 in an other server with same job and same documents. I >> will test on this other server and make you a return. >> >> >> >> Thanks for your help. >> >> >> >> *De :* Karl Wright [mailto:[email protected]] >> *Envoyé :* mardi 9 janvier 2018 13:15 >> *À :* [email protected] >> *Objet :* Re: Document connector excluding mime type and size - Tika >> Parser error >> >> >> >> I looked at the history of this. We had to release a patch (2.8.1) that >> put various poi jars at root level in order to work around a Tika problem. >> That patch may not have been entirely correct in that it looks like it may >> have blocked access by one of the deeper jars to a higher level. >> >> >> >> Release 2.9 should fix this if I am correct. >> >> >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 6:39 AM, Karl Wright <[email protected]> wrote: >> >> What version of MCF is this? That's important to know since Tika has had >> problems with this kind of thing in the past and this looks like something >> similar. >> >> >> >> The problem you are reporting is due to either a missing jar, or a bug in >> an internal tika classloader. But I need to know whether this is a current >> bug or not, since we just went to a new Tika version. >> >> >> >> Karl >> >> >> >> >> >> On Tue, Jan 9, 2018 at 4:32 AM, msaunier <[email protected]> wrote: >> >> Hello Karl, >> >> I hope you are well today. >> >> >> >> I have 2 problems with ManifoldCF. >> >> >> >> ----------- >> >> In **Outputs connectors** with Solr connector. I have add a « Maximum >> document length and I have « Excluded 5 mime types » but it not work. I >> join capture. >> >> >> >> ---------- >> >> And in second, I have a **Tika exception** in ManifoldCF. 3 documents >> are blocked : >> >> >> >> FATAL 2018-01-09T10:19:54,992 (Worker thread '5') - Error tossed: >> org.apache.poi.hwmf.record.HwmfFont.getCharSet()Lorg/apache/ >> poi/hwmf/record/HwmfFont$WmfCharset; >> >> java.lang.NoSuchMethodError: org.apache.poi.hwmf.record.Hwm >> fFont.getCharSet()Lorg/apache/poi/hwmf/record/HwmfFont$WmfCharset; >> >> at >> org.apache.tika.parser.microsoft.WMFParser.parse(WMFParser.java:74) >> ~[?:?] >> >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) >> ~[?:?] >> >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) >> ~[?:?] >> >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) >> ~[?:?] >> >> at >> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) >> ~[?:?] >> >> at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor. >> parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?] >> >> at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto >> r.handleEmbeddedFile(AbstractOOXMLExtractor.java:375) ~[?:?] >> >> at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto >> r.handleEmbeddedPart(AbstractOOXMLExtractor.java:260) ~[?:?] >> >> at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto >> r.handleEmbeddedParts(AbstractOOXMLExtractor.java:205) ~[?:?] >> >> at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtracto >> r.getXHTML(AbstractOOXMLExtractor.java:142) ~[?:?] >> >> at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory >> .parse(OOXMLExtractorFactory.java:142) ~[?:?] >> >> at >> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) >> ~[?:?] >> >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) >> ~[?:?] >> >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) >> ~[?:?] >> >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) >> ~[?:?] >> >> at >> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74) >> ~[?:?] >> >> at org.apache.manifoldcf.agents.transformation.tika.TikaExtract >> or.addOrReplaceDocumentWithException(TikaExtractor.java:235) ~[?:?] >> >> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn >> gester$PipelineAddEntryPoint.addOrReplaceDocumentWithExcepti >> on(IncrementalIngester.java:3226) ~[mcf-agents.jar:?] >> >> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn >> gester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077) >> ~[mcf-agents.jar:?] >> >> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn >> gester$PipelineObjectWithVersions.addOrReplaceDocumentWithEx >> ception(IncrementalIngester.java:2708) ~[mcf-agents.jar:?] >> >> at org.apache.manifoldcf.agents.incrementalingest.IncrementalIn >> gester.documentIngest(IncrementalIngester.java:756) ~[mcf-agents.jar:?] >> >> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct >> ivity.ingestDocumentWithException(WorkerThread.java:1583) >> ~[mcf-pull-agent.jar:?] >> >> at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessAct >> ivity.ingestDocumentWithException(WorkerThread.java:1548) >> ~[mcf-pull-agent.jar:?] >> >> at org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDr >> iveConnector.processDocuments(SharedDriveConnector.java:939) ~[?:?] >> >> at >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) >> [mcf-pull-agent.jar:?] >> >> >> >> I need to create an incident ticket? >> >> >> >> ---------- >> >> >> >> Thanks for your help. >> >> >> >> Cordialement, >> >> >> >> [image: msaunier] >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >
