[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653804#comment-16653804 ] Nick Burch commented on TIKA-2543: -- Great find Tim! Looks like an excellent resource on this. Assuming access to a Mac so you have the {{plutil}} tool to be able to generate (and check!) a bunch of representative test files, and helped by the various Tika IO helpers we have, my hunch is it'd be about a day's work to add support for the binary plist format + test it properly + wire into Tika. Maybe allow 2 days if new to Tika and/or new to decoding binary file formats. > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive, tika.plist > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653764#comment-16653764 ] Tim Allison commented on TIKA-2543: --- https://medium.com/@karaiskc/understanding-apples-binary-property-list-format-281e6da00dbd > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653725#comment-16653725 ] Tim Allison commented on TIKA-2543: --- Still on lookout for Java parser with an Apache friendly license that parses binary plists. commons-configuration handles flat text and xml, but not the more modern binary one. > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653718#comment-16653718 ] Tim Allison commented on TIKA-2543: --- TIKA-1358 might be relevant. We don't currently parse modern Apple files. :( > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653715#comment-16653715 ] Rafael Ferreira commented on TIKA-2543: --- If someone can point in the general area of the problem, I'm happy to try to get a PR out myself. Could It be a mime identification issue causing the correct parser to not be called? > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653711#comment-16653711 ] Rafael Ferreira commented on TIKA-2543: --- This seems like a more widespread issue than I imagined, extracting content from any plist seems to not work ATM, trying to parse a Pages file (pages version 7.2) triggers the EmptyParser and no text extracted. > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333797#comment-16333797 ] Rafael Ferreira commented on TIKA-2543: --- [~gagravarr] is this what you had in mind? Attached. [^Apache Tika – Configuring Tika.webarchive] > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316029#comment-16316029 ] Nick Burch commented on TIKA-2543: -- Based on https://en.wikipedia.org/wiki/Webarchive the underlying format for these is the apple binary plist format. It doesn't look like Commons Compress can handle this for us, unless I've missed that? Tika Devs - anyone know of a suitably licensed plist library for Java? [~cleverfoo] Are you able to create a small webarchive file for a simple-ish page we could use for testing? Maybe something like http://tika.apache.org/1.17/configuring.html ? > No content extraction for application/x-webarchive format > - > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 >Reporter: Rafael Ferreira >Priority: Minor > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)