[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653804#comment-16653804
 ] 

Nick Burch commented on TIKA-2543:
--

Great find Tim! Looks like an excellent resource on this.

Assuming access to a Mac so you have the {{plutil}} tool to be able to generate 
(and check!) a bunch of representative test files, and helped by the various 
Tika IO helpers we have, my hunch is it'd be about a day's work to add support 
for the binary plist format + test it properly + wire into Tika. Maybe allow 2 
days if new to Tika and/or new to decoding binary file formats.

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive, tika.plist
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653764#comment-16653764
 ] 

Tim Allison commented on TIKA-2543:
---

https://medium.com/@karaiskc/understanding-apples-binary-property-list-format-281e6da00dbd

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653725#comment-16653725
 ] 

Tim Allison commented on TIKA-2543:
---

Still on lookout for Java parser with an Apache friendly license that parses 
binary plists.  commons-configuration handles flat text and xml, but not the 
more modern binary one.

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653718#comment-16653718
 ] 

Tim Allison commented on TIKA-2543:
---

TIKA-1358 might be relevant.  We don't currently parse modern Apple files. :(

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Rafael Ferreira (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653715#comment-16653715
 ] 

Rafael Ferreira commented on TIKA-2543:
---

If someone can point in the general area of the problem, I'm happy to try to 
get a PR out myself.

Could It be a mime identification issue causing the correct parser to not be 
called? 

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-10-17 Thread Rafael Ferreira (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653711#comment-16653711
 ] 

Rafael Ferreira commented on TIKA-2543:
---

This seems like a more widespread issue than I imagined, extracting content 
from any plist seems to not work ATM, trying to parse a Pages file (pages 
version 7.2) triggers the EmptyParser and no text extracted. 

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-01-21 Thread Rafael Ferreira (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333797#comment-16333797
 ] 

Rafael Ferreira commented on TIKA-2543:
---

[~gagravarr] is this what you had in mind? Attached. [^Apache Tika – 
Configuring Tika.webarchive]

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
> Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format

2018-01-08 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316029#comment-16316029
 ] 

Nick Burch commented on TIKA-2543:
--

Based on https://en.wikipedia.org/wiki/Webarchive the underlying format for 
these is the apple binary plist format. It doesn't look like Commons Compress 
can handle this for us, unless I've missed that?

Tika Devs - anyone know of a suitably licensed plist library for Java?

[~cleverfoo] Are you able to create a small webarchive file for a simple-ish 
page we could use for testing? Maybe something like 
http://tika.apache.org/1.17/configuring.html ?

> No content extraction for application/x-webarchive format
> -
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8 
>Reporter: Rafael Ferreira
>Priority: Minor
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, 
> Charsets.UTF_8)) {
>   TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>   // this looks for content anywhere in the page independently of 
> orientation
>   tesseractOCRConfig.setPageSegMode("11");
>   ParseContext context = new ParseContext();
>   context.set(Parser.class, tika.getParser());
>   context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>   try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new 
> Metadata(), context);
>   } catch (SAXException e) {
> throw new EngineError(e);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)