date:20161220

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-20 Thread Pascal Essiembre (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766083#comment-15766083
 ] 

Pascal Essiembre commented on TIKA-1946:


It now throws a TikaException as you suggest.  For child mime-types, I am not 
sure what they would be. Given different QuattroPro formats seem to share the 
same mimetype, would we come up with some? I am not sure what's the general 
practice in this case.

> Add mime detection and parser for WordPerfect
> -
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Reporter: Nick C
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2016-12-20 Thread Derek Hardison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766074#comment-15766074
 ] 

Derek Hardison edited comment on TIKA-1788 at 12/21/16 4:24 AM:


The * indicates the name is wrapped and it can be any number of 
continuations... i.e. filename*0, filename*1, etc. will end up being 
concatenated into a single filename... then there are some rules on where the 
content-type etc is located when that happens.  You can find examples fairly 
easy i.e. exchange or gmail e-mails.

You can find some information about it here;

- https://tools.ietf.org/html/rfc2231#section-4
- https://tools.ietf.org/html/rfc5987#section-3.2.2


was (Author: derek.hardison):
The * indicates the name is wrapped and it can be any number of 
continuations... i.e. filename*0, filename*1, etc.

You can find some information about it here;

- https://tools.ietf.org/html/rfc2231#section-4
- https://tools.ietf.org/html/rfc5987#section-3.2.2

> message/rfc822 parser doesn't identify attachment filenames from 
> Content-Disposition header
> ---
>
> Key: TIKA-1788
> URL: https://issues.apache.org/jira/browse/TIKA-1788
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Sergey Tsalkov
> Attachments: grep_content_disposition.zip
>
>
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
> filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2016-12-20 Thread Derek Hardison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766074#comment-15766074
 ] 

Derek Hardison commented on TIKA-1788:
--

The * indicates the name is wrapped and it can be any number of 
continuations... i.e. filename*0, filename*1, etc.

You can find some information about it here;

- https://tools.ietf.org/html/rfc2231#section-4
- https://tools.ietf.org/html/rfc5987#section-3.2.2

> message/rfc822 parser doesn't identify attachment filenames from 
> Content-Disposition header
> ---
>
> Key: TIKA-1788
> URL: https://issues.apache.org/jira/browse/TIKA-1788
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Sergey Tsalkov
> Attachments: grep_content_disposition.zip
>
>
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
> filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-2094) Error parsing .doc file with visio embed

2016-12-20 Thread wangruochan (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangruochan closed TIKA-2094.
-

 verified as complete

> Error parsing .doc file with visio embed
> 
>
> Key: TIKA-2094
> URL: https://issues.apache.org/jira/browse/TIKA-2094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.13
> Environment: JDK7
>Reporter: wangruochan
> Attachments: testtika.doc, testtika.doc
>
>
> when I try to parse a  .doc file with a visio embeb,an exception occurred, 
> Print  the stacktrace  below:
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/microsoft/schemas/office/visio/x2012/main/ConnectsType
>   at 
> com.microsoft.schemas.office.visio.x2012.main.impl.PageContentsTypeImpl.getConnects(Unknown
>  Source)
>   at 
> org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:89)
>   at 
> org.apache.poi.xdgf.usermodel.XDGFPageContents.onDocumentRead(XDGFPageContents.java:73)
>   at 
> org.apache.poi.xdgf.usermodel.XDGFPages.onDocumentRead(XDGFPages.java:94)
>   at 
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:108)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160)
>   at 
> org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:79)
>   at 
> org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:41)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:212)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>   at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:164)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:208)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at test.apache.tika.Test.main(Test.java:29)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: java.lang.ClassNotFoundException: 
> com.microsoft.schemas.office.visio.x2012.main.ConnectsType
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 30 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract

2016-12-20 Thread Bipul Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765820#comment-15765820
 ] 

Bipul Kumar commented on TIKA-2190:
---

Hi Tim,

If you are okay, then should I take up this. I want to start contributing
and I can take up this.

Regards
Bipul




> Add "preserve_interword_spaces" option of tesseract
> ---
>
> Key: TIKA-2190
> URL: https://issues.apache.org/jira/browse/TIKA-2190
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Bipul Kumar
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
>
> This option will preserve the spaces for TXT output type so that the layout 
> or context can be inferred while further parsing. 
> to enable :: -c preserve_interword_spaces=1
> to disable :: -c preserve_interword_spaces=0 or simply don't mention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-20 Thread Luis Filipe Nassif (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765696#comment-15765696
 ] 

Luis Filipe Nassif commented on TIKA-1946:
--

Thank you, Pascal!

I think it may be better to throw a TikaException when parsing unsupported 
files, so client code will know that and can take other action, eg run a 
fallback parser like Latin1StringsParser.

If the files have different magic it would be better to break the mimetype into 
child ones and configure the parser only with the supported child mimetype.

> Add mime detection and parser for WordPerfect
> -
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Reporter: Nick C
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765514#comment-15765514
 ] 

Hudson commented on TIKA-2219:
--

SUCCESS: Integrated in Jenkins build tika-2.x #185 (See 
[https://builds.apache.org/job/tika-2.x/185/])
TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: 
rev 54154e0045066dfb50a10d158090262acaabaaba)
* (edit) 
tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java


> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2189) Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765501#comment-15765501
 ] 

Hudson commented on TIKA-2189:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1164 (See 
[https://builds.apache.org/job/Tika-trunk/1164/])
[TIKA-2189] fix for Default value mismatch for "enableImageProcessing" 
(kumarbipuldas: rev 40401e51b8a135634f54c8c437dafc10378059be)
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
add comment on outputType and trigger close of TIKA-2189. This closes 
(tallison: rev aa2407a6ba9826684df81af60c58b476e193a830)
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties


> Default value mismatch for "enableImageProcessing" in 
> TesseractOCRConfig.properties and TesseractOCRConfig.java 
> 
>
> Key: TIKA-2189
> URL: https://issues.apache.org/jira/browse/TIKA-2189
> Project: Tika
>  Issue Type: Bug
>  Components: ocr, parser
>Affects Versions: 1.14
>Reporter: Bipul Kumar
>Priority: Minor
>  Labels: ocr
> Fix For: 1.15
>
>
> Default value of "enableImageProcessing" should be "0" (image processing not 
> required by default) in TesseractOCRConfig.properties as same as 
> TesseractOCRConfig.java.
> That value "1" in TesseractOCRConfig.properties is overriding the default at 
> runtime. As per Javadoc, it is optional.
> /** 
>* Set the value to true if processing is to be enabled.
>* Default value is false.
>*/
>   public void setEnableImageProcessing(int enableImageProcessing) {
>   this.enableImageProcessing = enableImageProcessing;
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765500#comment-15765500
 ] 

Hudson commented on TIKA-2190:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1164 (See 
[https://builds.apache.org/job/Tika-trunk/1164/])
TIKA-2190 -- add configurability for preserve interword spacing (tallison: rev 
ae44b9e507dbb11b9b9f5c57cf342b47966ffb66)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) 
tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
* (edit) CHANGES.txt
* (add) tika-parsers/src/test/resources/test-documents/testOCR_spacing.png
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java


> Add "preserve_interword_spaces" option of tesseract
> ---
>
> Key: TIKA-2190
> URL: https://issues.apache.org/jira/browse/TIKA-2190
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Bipul Kumar
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
>
> This option will preserve the spaces for TXT output type so that the layout 
> or context can be inferred while further parsing. 
> to enable :: -c preserve_interword_spaces=1
> to disable :: -c preserve_interword_spaces=0 or simply don't mention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765412#comment-15765412
 ] 

Hudson commented on TIKA-2219:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #86 (See 
[https://builds.apache.org/job/tika-2.x-windows/86/])
TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: 
rev 54154e0045066dfb50a10d158090262acaabaaba)
* (edit) 
tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java


> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

tika-2.x-windows - Build # 86 - Still Failing

2016-12-20 Thread Apache Jenkins Server

The Apache Jenkins build system has built tika-2.x-windows (build #86)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/86/ to 
view the results.

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765359#comment-15765359
 ] 

Tim Allison commented on TIKA-1946:
---

W00t!  Christmas came early.  I'll take a look tomorrow.  Thank you!

> Add mime detection and parser for WordPerfect
> -
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Reporter: Nick C
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-20 Thread Pascal Essiembre (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765348#comment-15765348
 ] 

Pascal Essiembre commented on TIKA-1946:


I finally had a bit of time to port the WordPerfect parser to the project.  I 
also added a Quattro Pro parser (also from WordPerfect Office suite).  As it is 
the first time I make a pull-request for Tika, let me know if anything is not 
proper.   

The QuattroPro parser only supports *.qpw files, but since other QuatroPro 
formats share the same mime-type, the parser will be invoked for other formats 
as well (*.wb?).  I added a check in the parser code that will simply log a 
message stating the format is unsupported when encountered.  If you have a 
better approach to suggest let me know.

> Add mime detection and parser for WordPerfect
> -
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Reporter: Nick C
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-20 Thread Pascal Essiembre (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765348#comment-15765348
 ] 

Pascal Essiembre edited comment on TIKA-1946 at 12/20/16 9:51 PM:
--

I finally had a bit of time to port the WordPerfect parser to the project.  I 
also added a Quattro Pro parser (also from WordPerfect Office suite).  As it is 
the first time I make a pull-request for Tika, let me know if anything is not 
proper.   

The QuattroPro parser only supports .qpw files, but since other QuatroPro 
formats share the same mime-type, the parser will be invoked for other formats 
as well (.wb?).  I added a check in the parser code that will simply log a 
message stating the format is unsupported when encountered.  If you have a 
better approach to suggest let me know.


was (Author: pascal.essiembre):
I finally had a bit of time to port the WordPerfect parser to the project.  I 
also added a Quattro Pro parser (also from WordPerfect Office suite).  As it is 
the first time I make a pull-request for Tika, let me know if anything is not 
proper.   

The QuattroPro parser only supports *.qpw files, but since other QuatroPro 
formats share the same mime-type, the parser will be invoked for other formats 
as well (*.wb?).  I added a check in the parser code that will simply log a 
message stating the format is unsupported when encountered.  If you have a 
better approach to suggest let me know.

> Add mime detection and parser for WordPerfect
> -
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Reporter: Nick C
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765347#comment-15765347
 ] 

Tim Allison commented on TIKA-2219:
---

Looks like they aren't twiddling with the confidence scores any more 
[repo|http://bugs.icu-project.org/trac/browser/icu4j/tags/release-58-1/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java]

{noformat}
193 for (int i = 0; i < ALL_CS_RECOGNIZERS.size(); i++) {
194 CSRecognizerInfo rcinfo = ALL_CS_RECOGNIZERS.get(i);
195 boolean active = (fEnabledRecognizers != null) ? 
fEnabledRecognizers[i] : rcinfo.isDefaultEnabled;
196 if (active) {
197 CharsetMatch m = rcinfo.recognizer.match(this);
198 if (m != null) {
199 matches.add(m);
200 }
201 }
202 }
{noformat}

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765347#comment-15765347
 ] 

Tim Allison edited comment on TIKA-2219 at 12/20/16 9:51 PM:
-

Looks like they aren't twiddling with the confidence scores any more (see: 
[repo|http://bugs.icu-project.org/trac/browser/icu4j/tags/release-58-1/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java]):

{noformat}
193 for (int i = 0; i < ALL_CS_RECOGNIZERS.size(); i++) {
194 CSRecognizerInfo rcinfo = ALL_CS_RECOGNIZERS.get(i);
195 boolean active = (fEnabledRecognizers != null) ? 
fEnabledRecognizers[i] : rcinfo.isDefaultEnabled;
196 if (active) {
197 CharsetMatch m = rcinfo.recognizer.match(this);
198 if (m != null) {
199 matches.add(m);
200 }
201 }
202 }
{noformat}


was (Author: talli...@mitre.org):
Looks like they aren't twiddling with the confidence scores any more 
[repo|http://bugs.icu-project.org/trac/browser/icu4j/tags/release-58-1/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java]

{noformat}
193 for (int i = 0; i < ALL_CS_RECOGNIZERS.size(); i++) {
194 CSRecognizerInfo rcinfo = ALL_CS_RECOGNIZERS.get(i);
195 boolean active = (fEnabledRecognizers != null) ? 
fEnabledRecognizers[i] : rcinfo.isDefaultEnabled;
196 if (active) {
197 CharsetMatch m = rcinfo.recognizer.match(this);
198 if (m != null) {
199 matches.add(m);
200 }
201 }
202 }
{noformat}

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765343#comment-15765343
 ] 

Tim Allison commented on TIKA-2219:
---

Great.  Thank you!

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1946) Add mime detection and parser for WordPerfect

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765330#comment-15765330
 ] 

ASF GitHub Bot commented on TIKA-1946:
--

GitHub user essiembre opened a pull request:

https://github.com/apache/tika/pull/141

New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by 
pascal.essiembre



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/essiembre/tika TIKA-1946

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #141


commit 87c2ef3191d0a86502dc249240022b3cc973aaa4
Author: Pascal Essiembre 
Date:   2016-12-20T20:42:39Z

New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by
pascal.essiembre




> Add mime detection and parser for WordPerfect
> -
>
> Key: TIKA-1946
> URL: https://issues.apache.org/jira/browse/TIKA-1946
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Reporter: Nick C
>
> I noticed some code on github for parsing WordPerfect files 
> (https://github.com/Norconex/importer) Also looks like the author 
> [~pascal.essiembre] has contributed to Tika before



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] tika pull request #141: New WordPerfect and QuattroPro parsers for TIKA-1946...

2016-12-20 Thread essiembre

GitHub user essiembre opened a pull request:

https://github.com/apache/tika/pull/141

New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by 
pascal.essiembre



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/essiembre/tika TIKA-1946

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #141


commit 87c2ef3191d0a86502dc249240022b3cc973aaa4
Author: Pascal Essiembre 
Date:   2016-12-20T20:42:39Z

New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by
pascal.essiembre




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Resolved] (TIKA-2189) Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java

2016-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2189.
---
   Resolution: Fixed
Fix Version/s: 1.15

Thank you!

> Default value mismatch for "enableImageProcessing" in 
> TesseractOCRConfig.properties and TesseractOCRConfig.java 
> 
>
> Key: TIKA-2189
> URL: https://issues.apache.org/jira/browse/TIKA-2189
> Project: Tika
>  Issue Type: Bug
>  Components: ocr, parser
>Affects Versions: 1.14
>Reporter: Bipul Kumar
>Priority: Minor
>  Labels: ocr
> Fix For: 1.15
>
>
> Default value of "enableImageProcessing" should be "0" (image processing not 
> required by default) in TesseractOCRConfig.properties as same as 
> TesseractOCRConfig.java.
> That value "1" in TesseractOCRConfig.properties is overriding the default at 
> runtime. As per Javadoc, it is optional.
> /** 
>* Set the value to true if processing is to be enabled.
>* Default value is false.
>*/
>   public void setEnableImageProcessing(int enableImageProcessing) {
>   this.enableImageProcessing = enableImageProcessing;
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract

2016-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2190.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

Thank you!

> Add "preserve_interword_spaces" option of tesseract
> ---
>
> Key: TIKA-2190
> URL: https://issues.apache.org/jira/browse/TIKA-2190
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Bipul Kumar
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
>
> This option will preserve the spaces for TXT output type so that the layout 
> or context can be inferred while further parsing. 
> to enable :: -c preserve_interword_spaces=1
> to disable :: -c preserve_interword_spaces=0 or simply don't mention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] tika pull request #139: [TIKA-2189] fix for Default value mismatch for "enab...

2016-12-20 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/139


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (TIKA-2189) Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765300#comment-15765300
 ] 

ASF GitHub Bot commented on TIKA-2189:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/139


> Default value mismatch for "enableImageProcessing" in 
> TesseractOCRConfig.properties and TesseractOCRConfig.java 
> 
>
> Key: TIKA-2189
> URL: https://issues.apache.org/jira/browse/TIKA-2189
> Project: Tika
>  Issue Type: Bug
>  Components: ocr, parser
>Affects Versions: 1.14
>Reporter: Bipul Kumar
>Priority: Minor
>  Labels: ocr
>
> Default value of "enableImageProcessing" should be "0" (image processing not 
> required by default) in TesseractOCRConfig.properties as same as 
> TesseractOCRConfig.java.
> That value "1" in TesseractOCRConfig.properties is overriding the default at 
> runtime. As per Javadoc, it is optional.
> /** 
>* Set the value to true if processing is to be enabled.
>* Default value is false.
>*/
>   public void setEnableImageProcessing(int enableImageProcessing) {
>   this.enableImageProcessing = enableImageProcessing;
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765208#comment-15765208
 ] 

Hudson commented on TIKA-2221:
--

UNSTABLE: Integrated in Jenkins build tika-2.x #184 (See 
[https://builds.apache.org/job/tika-2.x/184/])
 TIKA-2221 -- correctly catch and rethrow encrypted document exception 
(tallison: rev ee761ac00c1dcc80f6c4030fe81a8780c5ac9d7e)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java


> poi.EncryptedDocumentException not wrapped in 
> tika.exception.EncryptedDocumentException
> ---
>
> Key: TIKA-2221
> URL: https://issues.apache.org/jira/browse/TIKA-2221
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: encryption, office, poi
> Fix For: 2.0, 1.15
>
>
> When parsing an encrypted Word document, a 
> org.apache.poi.EncryptedDocumentException is thrown at 
> WordExtractor.java#151. Tika catches this too far up the stack and 
> incorrectly wraps it in a plain TikaException instead of a 
> org.apache.tika.exception.EncryptedDocumentException.
> The fix would be to catch and wrap the exception correctly, for example:
> {noformat}
> try {
> document = new HWPFDocument(root);
> } catch (org.apache.poi.EncryptedDocumentException e) {
> throw new EncryptedDocumentException(e);
> } catch (OldWordFileFormatException e) {
> parseWord6(root, xhtml);
> return;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765207#comment-15765207
 ] 

Hudson commented on TIKA-2219:
--

UNSTABLE: Integrated in Jenkins build tika-2.x #184 (See 
[https://builds.apache.org/job/tika-2.x/184/])
TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: 
rev 68f3058643756d8e08f85903a585684f7d0f0b20)
* (edit) 
tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java
* (edit) 
tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testTXT_win-1252.txt


> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (TIKA-2190) Add "preserve_interword_spaces" option of tesseract

2016-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reassigned TIKA-2190:
-

Assignee: Tim Allison

> Add "preserve_interword_spaces" option of tesseract
> ---
>
> Key: TIKA-2190
> URL: https://issues.apache.org/jira/browse/TIKA-2190
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Reporter: Bipul Kumar
>Assignee: Tim Allison
>
> This option will preserve the spaces for TXT output type so that the layout 
> or context can be inferred while further parsing. 
> to enable :: -c preserve_interword_spaces=1
> to disable :: -c preserve_interword_spaces=0 or simply don't mention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765144#comment-15765144
 ] 

Hudson commented on TIKA-2219:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #85 (See 
[https://builds.apache.org/job/tika-2.x-windows/85/])
TIKA-2219 make sure to transmit charset name in detectAll via Pascal (tallison: 
rev 68f3058643756d8e08f85903a585684f7d0f0b20)
* (edit) 
tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
* (edit) 
tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java
* (add) 
tika-test-resources/src/test/resources/test-documents/testTXT_win-1252.txt


> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

tika-2.x-windows - Build # 85 - Still Failing

2016-12-20 Thread Apache Jenkins Server

The Apache Jenkins build system has built tika-2.x-windows (build #85)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/85/ to 
view the results.

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765129#comment-15765129
 ] 

Hudson commented on TIKA-2219:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1163 (See 
[https://builds.apache.org/job/Tika-trunk/1163/])
TIKA-2219 - make sure to transmit encoding name in detectAll() via (tallison: 
rev 2dbd65122a477c1b7a61d88e4fdf25a3d47effcd)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
* (add) tika-parsers/src/test/resources/test-documents/testTXT_win-1252.txt
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java


> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

tika-2.x - Build # 183 - Failure

2016-12-20 Thread Apache Jenkins Server

The Apache Jenkins build system has built tika-2.x (build #183)

Status: Failure

Check console output at https://builds.apache.org/job/tika-2.x/183/ to view the 
results.

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Pascal Essiembre (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765059#comment-15765059
 ] 

Pascal Essiembre commented on TIKA-2219:


BTW, I tested and can confirm you fix works just fine.

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Pascal Essiembre (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765034#comment-15765034
 ] 

Pascal Essiembre commented on TIKA-2219:


I am relying on CharsetDetector.  Thanks for the fix!

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2219.
---
   Resolution: Fixed
Fix Version/s: 1.15
   2.0

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
> Fix For: 2.0, 1.15
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765007#comment-15765007
 ] 

Hudson commented on TIKA-2221:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #84 (See 
[https://builds.apache.org/job/tika-2.x-windows/84/])
 TIKA-2221 -- correctly catch and rethrow encrypted document exception 
(tallison: rev ee761ac00c1dcc80f6c4030fe81a8780c5ac9d7e)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java


> poi.EncryptedDocumentException not wrapped in 
> tika.exception.EncryptedDocumentException
> ---
>
> Key: TIKA-2221
> URL: https://issues.apache.org/jira/browse/TIKA-2221
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: encryption, office, poi
> Fix For: 2.0, 1.15
>
>
> When parsing an encrypted Word document, a 
> org.apache.poi.EncryptedDocumentException is thrown at 
> WordExtractor.java#151. Tika catches this too far up the stack and 
> incorrectly wraps it in a plain TikaException instead of a 
> org.apache.tika.exception.EncryptedDocumentException.
> The fix would be to catch and wrap the exception correctly, for example:
> {noformat}
> try {
> document = new HWPFDocument(root);
> } catch (org.apache.poi.EncryptedDocumentException e) {
> throw new EncryptedDocumentException(e);
> } catch (OldWordFileFormatException e) {
> parseWord6(root, xhtml);
> return;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765010#comment-15765010
 ] 

Tim Allison commented on TIKA-2219:
---

Y, your diagnosis is correct.  Thank you.

So that we capture the updated confidence, I used the fuller constructor:

{noformat}
CharsetMatch m = new CharsetMatch(this, csr, confidence, 
charsetMatch.getName(), charsetMatch.getLanguage());
{noformat}

Out of curiosity are you using the standard detectors in order, or are you only 
using CharsetDetector?

I found that the UniversalEncodingDetector was applying ISO-8859-1 to the file 
I created to trigger "windows-1252" in CharsetDetector.

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

tika-2.x-windows - Build # 84 - Still Failing

2016-12-20 Thread Apache Jenkins Server

The Apache Jenkins build system has built tika-2.x-windows (build #84)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-2.x-windows/84/ to 
view the results.

[jira] [Commented] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764968#comment-15764968
 ] 

Hudson commented on TIKA-2221:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1162 (See 
[https://builds.apache.org/job/Tika-trunk/1162/])
TIKA-2221 -- correctly catch and convert encrypted document exception to 
(tallison: rev c62410443ca88f8118f50e6ee521a13a22f64729)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java


> poi.EncryptedDocumentException not wrapped in 
> tika.exception.EncryptedDocumentException
> ---
>
> Key: TIKA-2221
> URL: https://issues.apache.org/jira/browse/TIKA-2221
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: encryption, office, poi
> Fix For: 2.0, 1.15
>
>
> When parsing an encrypted Word document, a 
> org.apache.poi.EncryptedDocumentException is thrown at 
> WordExtractor.java#151. Tika catches this too far up the stack and 
> incorrectly wraps it in a plain TikaException instead of a 
> org.apache.tika.exception.EncryptedDocumentException.
> The fix would be to catch and wrap the exception correctly, for example:
> {noformat}
> try {
> document = new HWPFDocument(root);
> } catch (org.apache.poi.EncryptedDocumentException e) {
> throw new EncryptedDocumentException(e);
> } catch (OldWordFileFormatException e) {
> parseWord6(root, xhtml);
> return;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2220) Refactor/merge new experimental docx/pptx components

2016-12-20 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764967#comment-15764967
 ] 

Hudson commented on TIKA-2220:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1162 (See 
[https://builds.apache.org/job/Tika-trunk/1162/])
TIKA-2220 - refactor new sax pptx and docx to reduce code duplication. 
(tallison: rev 376318fc1b34014ec31d5fbfdfa962183ea8c717)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFTikaBodyPartHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractDocumentXMLBodyHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFDocumentXMLBodyHandler.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/WordAndPowerPointTextPartHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java
* (delete) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java


> Refactor/merge new experimental docx/pptx components
> 
>
> Key: TIKA-2220
> URL: https://issues.apache.org/jira/browse/TIKA-2220
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.15
>
>
> We can get rid of a fair amount of duplicate code by merging the docx and 
> pptx SAX handlers.  If we find significant differences in future desired 
> functionality, we can split them back out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException

2016-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2221.
---
   Resolution: Fixed
Fix Version/s: 2.0

Thank you!

> poi.EncryptedDocumentException not wrapped in 
> tika.exception.EncryptedDocumentException
> ---
>
> Key: TIKA-2221
> URL: https://issues.apache.org/jira/browse/TIKA-2221
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Matthew Caruana Galizia
>Priority: Minor
>  Labels: encryption, office, poi
> Fix For: 2.0, 1.15
>
>
> When parsing an encrypted Word document, a 
> org.apache.poi.EncryptedDocumentException is thrown at 
> WordExtractor.java#151. Tika catches this too far up the stack and 
> incorrectly wraps it in a plain TikaException instead of a 
> org.apache.tika.exception.EncryptedDocumentException.
> The fix would be to catch and wrap the exception correctly, for example:
> {noformat}
> try {
> document = new HWPFDocument(root);
> } catch (org.apache.poi.EncryptedDocumentException e) {
> throw new EncryptedDocumentException(e);
> } catch (OldWordFileFormatException e) {
> parseWord6(root, xhtml);
> return;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-2220) Refactor/merge new experimental docx/pptx components

2016-12-20 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2220.
---
   Resolution: Fixed
Fix Version/s: 1.15

We may want to split these out again in the future...

> Refactor/merge new experimental docx/pptx components
> 
>
> Key: TIKA-2220
> URL: https://issues.apache.org/jira/browse/TIKA-2220
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.15
>
>
> We can get rid of a fair amount of duplicate code by merging the docx and 
> pptx SAX handlers.  If we find significant differences in future desired 
> functionality, we can split them back out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-2221) poi.EncryptedDocumentException not wrapped in tika.exception.EncryptedDocumentException

2016-12-20 Thread Matthew Caruana Galizia (JIRA)

Matthew Caruana Galizia created TIKA-2221:
-

 Summary: poi.EncryptedDocumentException not wrapped in 
tika.exception.EncryptedDocumentException
 Key: TIKA-2221
 URL: https://issues.apache.org/jira/browse/TIKA-2221
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.14
Reporter: Matthew Caruana Galizia
Priority: Minor
 Fix For: 1.15


When parsing an encrypted Word document, a 
org.apache.poi.EncryptedDocumentException is thrown at WordExtractor.java#151. 
Tika catches this too far up the stack and incorrectly wraps it in a plain 
TikaException instead of a org.apache.tika.exception.EncryptedDocumentException.

The fix would be to catch and wrap the exception correctly, for example:

{noformat}
try {
document = new HWPFDocument(root);
} catch (org.apache.poi.EncryptedDocumentException e) {
throw new EncryptedDocumentException(e);
} catch (OldWordFileFormatException e) {
parseWord6(root, xhtml);
return;
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Apache Tika issue review (TIKA-2190 & TIKA-2189)

2016-12-20 Thread Chris Mattmann

Moving dev-owner to BCC.

 

I think you meant to send this to dev@tika.apache.org, so sending there J

 

 

 

From: Bipul Kumar 
Date: Tuesday, December 20, 2016 at 1:54 AM
To: "dev-ow...@tika.apache.org" , 
"talli...@mitre.org" , "talli...@apache.org" 

Subject: Apache Tika issue review (TIKA-2190 & TIKA-2189)

 

Hi, 

 

 I have raised two issue TIKA-2190 and TIKA-2189 regarding my observations 
while working on Tessaract OCR parser.

 

Please review and let me know.

 

Regards

Bipul

[jira] [Created] (TIKA-2220) Refactor/merge new experimental docx/pptx components

2016-12-20 Thread Tim Allison (JIRA)

Tim Allison created TIKA-2220:
-

 Summary: Refactor/merge new experimental docx/pptx components
 Key: TIKA-2220
 URL: https://issues.apache.org/jira/browse/TIKA-2220
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial


We can get rid of a fair amount of duplicate code by merging the docx and pptx 
SAX handlers.  If we find significant differences in future desired 
functionality, we can split them back out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764363#comment-15764363
 ] 

Tim Allison commented on TIKA-2219:
---

Thank you for opening this. This was caused by our "upgrade" to our copy of 
ICU4J (TIKA-2041).  I'll take a look.

> CharsetDetector no longer detects windows-1252 charset
> --
>
> Key: TIKA-2219
> URL: https://issues.apache.org/jira/browse/TIKA-2219
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Any.
>Reporter: Pascal Essiembre
>Priority: Minor
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //matches.add(m);
> // Add this instead:
> matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2201) OutOfMemoryError on a reasonably sized document

2016-12-20 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764283#comment-15764283
 ] 

Tim Allison commented on TIKA-2201:
---

I saved a single slide from the test document, and I'm getting an OOM with the 
DOM parser if -Xmx is < 500m.  Given that the DOM parser holds the entire deck 
in memory, that'd require ~60gb of memory. Whoa

We could try lazy loading of slides in the DOM parser in POI, but I think SAX 
is the way to go for Tika.

> OutOfMemoryError on a reasonably sized document
> ---
>
> Key: TIKA-2201
> URL: https://issues.apache.org/jira/browse/TIKA-2201
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
>
> The following document, which is not particularly big, causes an OOM in Tika 
> parser:
> https://dl.dropboxusercontent.com/u/92341073/Certificates-9-20-2013.pptx
> Java memory limit is 4GB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

45 matches

Mail list logo