[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766083#comment-15766083 ] Pascal Essiembre commented on TIKA-1946: It now throws a TikaException as you suggest. For child mime-types, I am not sure what they would be. Given different QuattroPro formats seem to share the same mimetype, would we come up with some? I am not sure what's the general practice in this case. > Add mime detection and parser for WordPerfect > - > > Key: TIKA-1946 > URL: https://issues.apache.org/jira/browse/TIKA-1946 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Reporter: Nick C > > I noticed some code on github for parsing WordPerfect files > (https://github.com/Norconex/importer) Also looks like the author > [~pascal.essiembre] has contributed to Tika before -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header
[ https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766074#comment-15766074 ] Derek Hardison edited comment on TIKA-1788 at 12/21/16 4:24 AM: The * indicates the name is wrapped and it can be any number of continuations... i.e. filename*0, filename*1, etc. will end up being concatenated into a single filename... then there are some rules on where the content-type etc is located when that happens. You can find examples fairly easy i.e. exchange or gmail e-mails. You can find some information about it here; - https://tools.ietf.org/html/rfc2231#section-4 - https://tools.ietf.org/html/rfc5987#section-3.2.2 was (Author: derek.hardison): The * indicates the name is wrapped and it can be any number of continuations... i.e. filename*0, filename*1, etc. You can find some information about it here; - https://tools.ietf.org/html/rfc2231#section-4 - https://tools.ietf.org/html/rfc5987#section-3.2.2 > message/rfc822 parser doesn't identify attachment filenames from > Content-Disposition header > --- > > Key: TIKA-1788 > URL: https://issues.apache.org/jira/browse/TIKA-1788 > Project: Tika > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Sergey Tsalkov > Attachments: grep_content_disposition.zip > > > rfc822 email files can contain attachments as subparts, and they'll > generally specify the filename of the attachment in a manner like > this: > Content-Disposition: attachment; > filename*=utf-8''image001.jpg > Tika doesn't seem to be grabbing that information at all! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header
[ https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766074#comment-15766074 ] Derek Hardison commented on TIKA-1788: -- The * indicates the name is wrapped and it can be any number of continuations... i.e. filename*0, filename*1, etc. You can find some information about it here; - https://tools.ietf.org/html/rfc2231#section-4 - https://tools.ietf.org/html/rfc5987#section-3.2.2 > message/rfc822 parser doesn't identify attachment filenames from > Content-Disposition header > --- > > Key: TIKA-1788 > URL: https://issues.apache.org/jira/browse/TIKA-1788 > Project: Tika > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Sergey Tsalkov > Attachments: grep_content_disposition.zip > > > rfc822 email files can contain attachments as subparts, and they'll > generally specify the filename of the attachment in a manner like > this: > Content-Disposition: attachment; > filename*=utf-8''image001.jpg > Tika doesn't seem to be grabbing that information at all! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-2094) Error parsing .doc file with visio embed
[ https://issues.apache.org/jira/browse/TIKA-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangruochan closed TIKA-2094. - verified as complete > Error parsing .doc file with visio embed > > > Key: TIKA-2094 > URL: https://issues.apache.org/jira/browse/TIKA-2094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.13 > Environment: JDK7 >Reporter: wangruochan > Attachments: testtika.doc, testtika.doc > > > when I try to parse a .doc file with a visio embeb,an exception occurred, > Print the stacktrace below: > Exception in thread "main" java.lang.NoClassDefFoundError: > com/microsoft/schemas/office/visio/x2012/main/ConnectsType > at > com.microsoft.schemas.office.visio.x2012.main.impl.PageContentsTypeImpl.getConnects(Unknown > Source) > at > org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:89) > at > org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:73) > at > org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:94) > at > org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:108) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:79) > at > org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:41) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:164) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:208) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at test.apache.tika.Test.main(Test.java:29) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) > Caused by: java.lang.ClassNotFoundException: > com.microsoft.schemas.office.visio.x2012.main.ConnectsType > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 30 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract
[ https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765820#comment-15765820 ] Bipul Kumar commented on TIKA-2190: --- Hi Tim, If you are okay, then should I take up this. I want to start contributing and I can take up this. Regards Bipul > Add "preserve_interword_spaces" option of tesseract > --- > > Key: TIKA-2190 > URL: https://issues.apache.org/jira/browse/TIKA-2190 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Bipul Kumar >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > > This option will preserve the spaces for TXT output type so that the layout > or context can be inferred while further parsing. > to enable :: -c preserve_interword_spaces=1 > to disable :: -c preserve_interword_spaces=0 or simply don't mention -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765696#comment-15765696 ] Luis Filipe Nassif commented on TIKA-1946: -- Thank you, Pascal! I think it may be better to throw a TikaException when parsing unsupported files, so client code will know that and can take other action, eg run a fallback parser like Latin1StringsParser. If the files have different magic it would be better to break the mimetype into child ones and configure the parser only with the supported child mimetype. > Add mime detection and parser for WordPerfect > - > > Key: TIKA-1946 > URL: https://issues.apache.org/jira/browse/TIKA-1946 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Reporter: Nick C > > I noticed some code on github for parsing WordPerfect files > (https://github.com/Norconex/importer) Also looks like the author > [~pascal.essiembre] has contributed to Tika before -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765514#comment-15765514 ] Hudson commented on TIKA-2219: -- SUCCESS: Integrated in Jenkins build tika-2.x #185 (See [https://builds.apache.org/job/tika-2.x/185/]) TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: rev 54154e0045066dfb50a10d158090262acaabaaba) * (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2189) Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java
[ https://issues.apache.org/jira/browse/TIKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765501#comment-15765501 ] Hudson commented on TIKA-2189: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1164 (See [https://builds.apache.org/job/Tika-trunk/1164/]) [TIKA-2189] fix for Default value mismatch for "enableImageProcessing" (kumarbipuldas: rev 40401e51b8a135634f54c8c437dafc10378059be) * (edit) tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties add comment on outputType and trigger close of TIKA-2189. This closes (tallison: rev aa2407a6ba9826684df81af60c58b476e193a830) * (edit) tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties > Default value mismatch for "enableImageProcessing" in > TesseractOCRConfig.properties and TesseractOCRConfig.java > > > Key: TIKA-2189 > URL: https://issues.apache.org/jira/browse/TIKA-2189 > Project: Tika > Issue Type: Bug > Components: ocr, parser >Affects Versions: 1.14 >Reporter: Bipul Kumar >Priority: Minor > Labels: ocr > Fix For: 1.15 > > > Default value of "enableImageProcessing" should be "0" (image processing not > required by default) in TesseractOCRConfig.properties as same as > TesseractOCRConfig.java. > That value "1" in TesseractOCRConfig.properties is overriding the default at > runtime. As per Javadoc, it is optional. > /** >* Set the value to true if processing is to be enabled. >* Default value is false. >*/ > public void setEnableImageProcessing(int enableImageProcessing) { > this.enableImageProcessing = enableImageProcessing; > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract
[ https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765500#comment-15765500 ] Hudson commented on TIKA-2190: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1164 (See [https://builds.apache.org/job/Tika-trunk/1164/]) TIKA-2190 -- add configurability for preserve interword spacing (tallison: rev ae44b9e507dbb11b9b9f5c57cf342b47966ffb66) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * (edit) tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties * (edit) CHANGES.txt * (add) tika-parsers/src/test/resources/test-documents/testOCR_spacing.png * (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java > Add "preserve_interword_spaces" option of tesseract > --- > > Key: TIKA-2190 > URL: https://issues.apache.org/jira/browse/TIKA-2190 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Bipul Kumar >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > > This option will preserve the spaces for TXT output type so that the layout > or context can be inferred while further parsing. > to enable :: -c preserve_interword_spaces=1 > to disable :: -c preserve_interword_spaces=0 or simply don't mention -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765412#comment-15765412 ] Hudson commented on TIKA-2219: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #86 (See [https://builds.apache.org/job/tika-2.x-windows/86/]) TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: rev 54154e0045066dfb50a10d158090262acaabaaba) * (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-2.x-windows - Build # 86 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #86) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/86/ to view the results.
[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765359#comment-15765359 ] Tim Allison commented on TIKA-1946: --- W00t! Christmas came early. I'll take a look tomorrow. Thank you! > Add mime detection and parser for WordPerfect > - > > Key: TIKA-1946 > URL: https://issues.apache.org/jira/browse/TIKA-1946 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Reporter: Nick C > > I noticed some code on github for parsing WordPerfect files > (https://github.com/Norconex/importer) Also looks like the author > [~pascal.essiembre] has contributed to Tika before -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765348#comment-15765348 ] Pascal Essiembre commented on TIKA-1946: I finally had a bit of time to port the WordPerfect parser to the project. I also added a Quattro Pro parser (also from WordPerfect Office suite). As it is the first time I make a pull-request for Tika, let me know if anything is not proper. The QuattroPro parser only supports *.qpw files, but since other QuatroPro formats share the same mime-type, the parser will be invoked for other formats as well (*.wb?). I added a check in the parser code that will simply log a message stating the format is unsupported when encountered. If you have a better approach to suggest let me know. > Add mime detection and parser for WordPerfect > - > > Key: TIKA-1946 > URL: https://issues.apache.org/jira/browse/TIKA-1946 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Reporter: Nick C > > I noticed some code on github for parsing WordPerfect files > (https://github.com/Norconex/importer) Also looks like the author > [~pascal.essiembre] has contributed to Tika before -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1946) Add mime detection and parser for WordPerfect
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765348#comment-15765348 ] Pascal Essiembre edited comment on TIKA-1946 at 12/20/16 9:51 PM: -- I finally had a bit of time to port the WordPerfect parser to the project. I also added a Quattro Pro parser (also from WordPerfect Office suite). As it is the first time I make a pull-request for Tika, let me know if anything is not proper. The QuattroPro parser only supports .qpw files, but since other QuatroPro formats share the same mime-type, the parser will be invoked for other formats as well (.wb?). I added a check in the parser code that will simply log a message stating the format is unsupported when encountered. If you have a better approach to suggest let me know. was (Author: pascal.essiembre): I finally had a bit of time to port the WordPerfect parser to the project. I also added a Quattro Pro parser (also from WordPerfect Office suite). As it is the first time I make a pull-request for Tika, let me know if anything is not proper. The QuattroPro parser only supports *.qpw files, but since other QuatroPro formats share the same mime-type, the parser will be invoked for other formats as well (*.wb?). I added a check in the parser code that will simply log a message stating the format is unsupported when encountered. If you have a better approach to suggest let me know. > Add mime detection and parser for WordPerfect > - > > Key: TIKA-1946 > URL: https://issues.apache.org/jira/browse/TIKA-1946 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Reporter: Nick C > > I noticed some code on github for parsing WordPerfect files > (https://github.com/Norconex/importer) Also looks like the author > [~pascal.essiembre] has contributed to Tika before -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765347#comment-15765347 ] Tim Allison commented on TIKA-2219: --- Looks like they aren't twiddling with the confidence scores any more [repo|http://bugs.icu-project.org/trac/browser/icu4j/tags/release-58-1/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java] {noformat} 193 for (int i = 0; i < ALL_CS_RECOGNIZERS.size(); i++) { 194 CSRecognizerInfo rcinfo = ALL_CS_RECOGNIZERS.get(i); 195 boolean active = (fEnabledRecognizers != null) ? fEnabledRecognizers[i] : rcinfo.isDefaultEnabled; 196 if (active) { 197 CharsetMatch m = rcinfo.recognizer.match(this); 198 if (m != null) { 199 matches.add(m); 200 } 201 } 202 } {noformat} > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765347#comment-15765347 ] Tim Allison edited comment on TIKA-2219 at 12/20/16 9:51 PM: - Looks like they aren't twiddling with the confidence scores any more (see: [repo|http://bugs.icu-project.org/trac/browser/icu4j/tags/release-58-1/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java]): {noformat} 193 for (int i = 0; i < ALL_CS_RECOGNIZERS.size(); i++) { 194 CSRecognizerInfo rcinfo = ALL_CS_RECOGNIZERS.get(i); 195 boolean active = (fEnabledRecognizers != null) ? fEnabledRecognizers[i] : rcinfo.isDefaultEnabled; 196 if (active) { 197 CharsetMatch m = rcinfo.recognizer.match(this); 198 if (m != null) { 199 matches.add(m); 200 } 201 } 202 } {noformat} was (Author: talli...@mitre.org): Looks like they aren't twiddling with the confidence scores any more [repo|http://bugs.icu-project.org/trac/browser/icu4j/tags/release-58-1/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java] {noformat} 193 for (int i = 0; i < ALL_CS_RECOGNIZERS.size(); i++) { 194 CSRecognizerInfo rcinfo = ALL_CS_RECOGNIZERS.get(i); 195 boolean active = (fEnabledRecognizers != null) ? fEnabledRecognizers[i] : rcinfo.isDefaultEnabled; 196 if (active) { 197 CharsetMatch m = rcinfo.recognizer.match(this); 198 if (m != null) { 199 matches.add(m); 200 } 201 } 202 } {noformat} > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765343#comment-15765343 ] Tim Allison commented on TIKA-2219: --- Great. Thank you! > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect
[ https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765330#comment-15765330 ] ASF GitHub Bot commented on TIKA-1946: -- GitHub user essiembre opened a pull request: https://github.com/apache/tika/pull/141 New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre You can merge this pull request into a Git repository by running: $ git pull https://github.com/essiembre/tika TIKA-1946 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #141 commit 87c2ef3191d0a86502dc249240022b3cc973aaa4 Author: Pascal EssiembreDate: 2016-12-20T20:42:39Z New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre > Add mime detection and parser for WordPerfect > - > > Key: TIKA-1946 > URL: https://issues.apache.org/jira/browse/TIKA-1946 > Project: Tika > Issue Type: Improvement > Components: mime, parser >Reporter: Nick C > > I noticed some code on github for parsing WordPerfect files > (https://github.com/Norconex/importer) Also looks like the author > [~pascal.essiembre] has contributed to Tika before -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request #141: New WordPerfect and QuattroPro parsers for TIKA-1946...
GitHub user essiembre opened a pull request: https://github.com/apache/tika/pull/141 New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre You can merge this pull request into a Git repository by running: $ git pull https://github.com/essiembre/tika TIKA-1946 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #141 commit 87c2ef3191d0a86502dc249240022b3cc973aaa4 Author: Pascal EssiembreDate: 2016-12-20T20:42:39Z New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (TIKA-2189) Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java
[ https://issues.apache.org/jira/browse/TIKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2189. --- Resolution: Fixed Fix Version/s: 1.15 Thank you! > Default value mismatch for "enableImageProcessing" in > TesseractOCRConfig.properties and TesseractOCRConfig.java > > > Key: TIKA-2189 > URL: https://issues.apache.org/jira/browse/TIKA-2189 > Project: Tika > Issue Type: Bug > Components: ocr, parser >Affects Versions: 1.14 >Reporter: Bipul Kumar >Priority: Minor > Labels: ocr > Fix For: 1.15 > > > Default value of "enableImageProcessing" should be "0" (image processing not > required by default) in TesseractOCRConfig.properties as same as > TesseractOCRConfig.java. > That value "1" in TesseractOCRConfig.properties is overriding the default at > runtime. As per Javadoc, it is optional. > /** >* Set the value to true if processing is to be enabled. >* Default value is false. >*/ > public void setEnableImageProcessing(int enableImageProcessing) { > this.enableImageProcessing = enableImageProcessing; > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract
[ https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2190. --- Resolution: Fixed Fix Version/s: 1.15 2.0 Thank you! > Add "preserve_interword_spaces" option of tesseract > --- > > Key: TIKA-2190 > URL: https://issues.apache.org/jira/browse/TIKA-2190 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Bipul Kumar >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > > This option will preserve the spaces for TXT output type so that the layout > or context can be inferred while further parsing. > to enable :: -c preserve_interword_spaces=1 > to disable :: -c preserve_interword_spaces=0 or simply don't mention -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request #139: [TIKA-2189] fix for Default value mismatch for "enab...
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/139 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-2189) Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java
[ https://issues.apache.org/jira/browse/TIKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765300#comment-15765300 ] ASF GitHub Bot commented on TIKA-2189: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/139 > Default value mismatch for "enableImageProcessing" in > TesseractOCRConfig.properties and TesseractOCRConfig.java > > > Key: TIKA-2189 > URL: https://issues.apache.org/jira/browse/TIKA-2189 > Project: Tika > Issue Type: Bug > Components: ocr, parser >Affects Versions: 1.14 >Reporter: Bipul Kumar >Priority: Minor > Labels: ocr > > Default value of "enableImageProcessing" should be "0" (image processing not > required by default) in TesseractOCRConfig.properties as same as > TesseractOCRConfig.java. > That value "1" in TesseractOCRConfig.properties is overriding the default at > runtime. As per Javadoc, it is optional. > /** >* Set the value to true if processing is to be enabled. >* Default value is false. >*/ > public void setEnableImageProcessing(int enableImageProcessing) { > this.enableImageProcessing = enableImageProcessing; > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException
[ https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765208#comment-15765208 ] Hudson commented on TIKA-2221: -- UNSTABLE: Integrated in Jenkins build tika-2.x #184 (See [https://builds.apache.org/job/tika-2.x/184/]) TIKA-2221 -- correctly catch and rethrow encrypted document exception (tallison: rev ee761ac00c1dcc80f6c4030fe81a8780c5ac9d7e) * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java > poi.EncryptedDocumentException not wrapped in > tika.exception.EncryptedDocumentException > --- > > Key: TIKA-2221 > URL: https://issues.apache.org/jira/browse/TIKA-2221 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: encryption, office, poi > Fix For: 2.0, 1.15 > > > When parsing an encrypted Word document, a > org.apache.poi.EncryptedDocumentException is thrown at > WordExtractor.java#151. Tika catches this too far up the stack and > incorrectly wraps it in a plain TikaException instead of a > org.apache.tika.exception.EncryptedDocumentException. > The fix would be to catch and wrap the exception correctly, for example: > {noformat} > try { > document = new HWPFDocument(root); > } catch (org.apache.poi.EncryptedDocumentException e) { > throw new EncryptedDocumentException(e); > } catch (OldWordFileFormatException e) { > parseWord6(root, xhtml); > return; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765207#comment-15765207 ] Hudson commented on TIKA-2219: -- UNSTABLE: Integrated in Jenkins build tika-2.x #184 (See [https://builds.apache.org/job/tika-2.x/184/]) TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: rev 68f3058643756d8e08f85903a585684f7d0f0b20) * (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java * (edit) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java * (add) tika-test-resources/src/test/resources/test-documents/testTXT_win-1252.txt > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract
[ https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-2190: - Assignee: Tim Allison > Add "preserve_interword_spaces" option of tesseract > --- > > Key: TIKA-2190 > URL: https://issues.apache.org/jira/browse/TIKA-2190 > Project: Tika > Issue Type: Improvement > Components: ocr >Reporter: Bipul Kumar >Assignee: Tim Allison > > This option will preserve the spaces for TXT output type so that the layout > or context can be inferred while further parsing. > to enable :: -c preserve_interword_spaces=1 > to disable :: -c preserve_interword_spaces=0 or simply don't mention -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765144#comment-15765144 ] Hudson commented on TIKA-2219: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #85 (See [https://builds.apache.org/job/tika-2.x-windows/85/]) TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: rev 68f3058643756d8e08f85903a585684f7d0f0b20) * (edit) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java * (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java * (add) tika-test-resources/src/test/resources/test-documents/testTXT_win-1252.txt > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-2.x-windows - Build # 85 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #85) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/85/ to view the results.
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765129#comment-15765129 ] Hudson commented on TIKA-2219: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1163 (See [https://builds.apache.org/job/Tika-trunk/1163/]) TIKA-2219 - make sure to transmit encoding name in detectAll() via (tallison: rev 2dbd65122a477c1b7a61d88e4fdf25a3d47effcd) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java * (add) tika-parsers/src/test/resources/test-documents/testTXT_win-1252.txt * (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-2.x - Build # 183 - Failure
The Apache Jenkins build system has built tika-2.x (build #183) Status: Failure Check console output at https://builds.apache.org/job/tika-2.x/183/ to view the results.
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765059#comment-15765059 ] Pascal Essiembre commented on TIKA-2219: BTW, I tested and can confirm you fix works just fine. > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765034#comment-15765034 ] Pascal Essiembre commented on TIKA-2219: I am relying on CharsetDetector. Thanks for the fix! > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2219. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > Fix For: 2.0, 1.15 > > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException
[ https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765007#comment-15765007 ] Hudson commented on TIKA-2221: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #84 (See [https://builds.apache.org/job/tika-2.x-windows/84/]) TIKA-2221 -- correctly catch and rethrow encrypted document exception (tallison: rev ee761ac00c1dcc80f6c4030fe81a8780c5ac9d7e) * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java > poi.EncryptedDocumentException not wrapped in > tika.exception.EncryptedDocumentException > --- > > Key: TIKA-2221 > URL: https://issues.apache.org/jira/browse/TIKA-2221 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: encryption, office, poi > Fix For: 2.0, 1.15 > > > When parsing an encrypted Word document, a > org.apache.poi.EncryptedDocumentException is thrown at > WordExtractor.java#151. Tika catches this too far up the stack and > incorrectly wraps it in a plain TikaException instead of a > org.apache.tika.exception.EncryptedDocumentException. > The fix would be to catch and wrap the exception correctly, for example: > {noformat} > try { > document = new HWPFDocument(root); > } catch (org.apache.poi.EncryptedDocumentException e) { > throw new EncryptedDocumentException(e); > } catch (OldWordFileFormatException e) { > parseWord6(root, xhtml); > return; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765010#comment-15765010 ] Tim Allison commented on TIKA-2219: --- Y, your diagnosis is correct. Thank you. So that we capture the updated confidence, I used the fuller constructor: {noformat} CharsetMatch m = new CharsetMatch(this, csr, confidence, charsetMatch.getName(), charsetMatch.getLanguage()); {noformat} Out of curiosity are you using the standard detectors in order, or are you only using CharsetDetector? I found that the UniversalEncodingDetector was applying ISO-8859-1 to the file I created to trigger "windows-1252" in CharsetDetector. > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-2.x-windows - Build # 84 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #84) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/84/ to view the results.
[jira] [Commented] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException
[ https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764968#comment-15764968 ] Hudson commented on TIKA-2221: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1162 (See [https://builds.apache.org/job/Tika-trunk/1162/]) TIKA-2221 -- correctly catch and convert encrypted document exception to (tallison: rev c62410443ca88f8118f50e6ee521a13a22f64729) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java > poi.EncryptedDocumentException not wrapped in > tika.exception.EncryptedDocumentException > --- > > Key: TIKA-2221 > URL: https://issues.apache.org/jira/browse/TIKA-2221 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: encryption, office, poi > Fix For: 2.0, 1.15 > > > When parsing an encrypted Word document, a > org.apache.poi.EncryptedDocumentException is thrown at > WordExtractor.java#151. Tika catches this too far up the stack and > incorrectly wraps it in a plain TikaException instead of a > org.apache.tika.exception.EncryptedDocumentException. > The fix would be to catch and wrap the exception correctly, for example: > {noformat} > try { > document = new HWPFDocument(root); > } catch (org.apache.poi.EncryptedDocumentException e) { > throw new EncryptedDocumentException(e); > } catch (OldWordFileFormatException e) { > parseWord6(root, xhtml); > return; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2220) Refactor/merge new experimental docx/pptx components
[ https://issues.apache.org/jira/browse/TIKA-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764967#comment-15764967 ] Hudson commented on TIKA-2220: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1162 (See [https://builds.apache.org/job/Tika-trunk/1162/]) TIKA-2220 - refactor new sax pptx and docx to reduce code duplication. (tallison: rev 376318fc1b34014ec31d5fbfdfa962183ea8c717) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFTikaBodyPartHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractDocumentXMLBodyHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFDocumentXMLBodyHandler.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/WordAndPowerPointTextPartHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java * (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java * (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java > Refactor/merge new experimental docx/pptx components > > > Key: TIKA-2220 > URL: https://issues.apache.org/jira/browse/TIKA-2220 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.15 > > > We can get rid of a fair amount of duplicate code by merging the docx and > pptx SAX handlers. If we find significant differences in future desired > functionality, we can split them back out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException
[ https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2221. --- Resolution: Fixed Fix Version/s: 2.0 Thank you! > poi.EncryptedDocumentException not wrapped in > tika.exception.EncryptedDocumentException > --- > > Key: TIKA-2221 > URL: https://issues.apache.org/jira/browse/TIKA-2221 > Project: Tika > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Matthew Caruana Galizia >Priority: Minor > Labels: encryption, office, poi > Fix For: 2.0, 1.15 > > > When parsing an encrypted Word document, a > org.apache.poi.EncryptedDocumentException is thrown at > WordExtractor.java#151. Tika catches this too far up the stack and > incorrectly wraps it in a plain TikaException instead of a > org.apache.tika.exception.EncryptedDocumentException. > The fix would be to catch and wrap the exception correctly, for example: > {noformat} > try { > document = new HWPFDocument(root); > } catch (org.apache.poi.EncryptedDocumentException e) { > throw new EncryptedDocumentException(e); > } catch (OldWordFileFormatException e) { > parseWord6(root, xhtml); > return; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2220) Refactor/merge new experimental docx/pptx components
[ https://issues.apache.org/jira/browse/TIKA-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2220. --- Resolution: Fixed Fix Version/s: 1.15 We may want to split these out again in the future... > Refactor/merge new experimental docx/pptx components > > > Key: TIKA-2220 > URL: https://issues.apache.org/jira/browse/TIKA-2220 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Fix For: 1.15 > > > We can get rid of a fair amount of duplicate code by merging the docx and > pptx SAX handlers. If we find significant differences in future desired > functionality, we can split them back out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException
Matthew Caruana Galizia created TIKA-2221: - Summary: poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException Key: TIKA-2221 URL: https://issues.apache.org/jira/browse/TIKA-2221 Project: Tika Issue Type: Bug Affects Versions: 1.14 Reporter: Matthew Caruana Galizia Priority: Minor Fix For: 1.15 When parsing an encrypted Word document, a org.apache.poi.EncryptedDocumentException is thrown at WordExtractor.java#151. Tika catches this too far up the stack and incorrectly wraps it in a plain TikaException instead of a org.apache.tika.exception.EncryptedDocumentException. The fix would be to catch and wrap the exception correctly, for example: {noformat} try { document = new HWPFDocument(root); } catch (org.apache.poi.EncryptedDocumentException e) { throw new EncryptedDocumentException(e); } catch (OldWordFileFormatException e) { parseWord6(root, xhtml); return; } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Apache Tika issue review (TIKA-2190 & TIKA-2189)
Moving dev-owner to BCC. I think you meant to send this to dev@tika.apache.org, so sending there J From: Bipul KumarDate: Tuesday, December 20, 2016 at 1:54 AM To: "dev-ow...@tika.apache.org" , "talli...@mitre.org" , "talli...@apache.org" Subject: Apache Tika issue review (TIKA-2190 & TIKA-2189) Hi, I have raised two issue TIKA-2190 and TIKA-2189 regarding my observations while working on Tessaract OCR parser. Please review and let me know. Regards Bipul
[jira] [Created] (TIKA-2220) Refactor/merge new experimental docx/pptx components
Tim Allison created TIKA-2220: - Summary: Refactor/merge new experimental docx/pptx components Key: TIKA-2220 URL: https://issues.apache.org/jira/browse/TIKA-2220 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial We can get rid of a fair amount of duplicate code by merging the docx and pptx SAX handlers. If we find significant differences in future desired functionality, we can split them back out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset
[ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764363#comment-15764363 ] Tim Allison commented on TIKA-2219: --- Thank you for opening this. This was caused by our "upgrade" to our copy of ICU4J (TIKA-2041). I'll take a look. > CharsetDetector no longer detects windows-1252 charset > -- > > Key: TIKA-2219 > URL: https://issues.apache.org/jira/browse/TIKA-2219 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Any. >Reporter: Pascal Essiembre >Priority: Minor > > Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is > always detected instead. While not tested, this likely affects other > windows-125* encodings as well. > I tracked it down to a change in the > {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method. Now it always > returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? > "windows-1252" : "ISO-8859-1";}} > Now that condition has been moved to the {{match(CharsetDetector det)}} > method so that the returned CharsetMatch has the proper name. The problem > with that is {{CharsetDetector#detectAll()}} method overwrites the correct > match with a new one that will return the value of {{#getName()}} from the > {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case). > There might be legitimate reasons why the {{CharsetMatch}} instances in > {{detectAll()}} method are replaced with new ones, but changing this code in > that method appears to work for me: > // Remove this: > //CharsetMatch m = new CharsetMatch(this, csr, > confidence); > //matches.add(m); > // Add this instead: > matches.add(charsetMatch); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2201) OutOfMemoryError on a reasonably sized document
[ https://issues.apache.org/jira/browse/TIKA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764283#comment-15764283 ] Tim Allison commented on TIKA-2201: --- I saved a single slide from the test document, and I'm getting an OOM with the DOM parser if -Xmx is < 500m. Given that the DOM parser holds the entire deck in memory, that'd require ~60gb of memory. Whoa We could try lazy loading of slides in the DOM parser in POI, but I think SAX is the way to go for Tika. > OutOfMemoryError on a reasonably sized document > --- > > Key: TIKA-2201 > URL: https://issues.apache.org/jira/browse/TIKA-2201 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > > The following document, which is not particularly big, causes an OOM in Tika > parser: > https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx > Java memory limit is 4GB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)