[jira] [Commented] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829 ] Tim Allison commented on TIKA-1511: --- The RecursiveParserWrapper should allow, that, no? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829 ] Tim Allison edited comment on TIKA-1511 at 1/26/15 1:52 PM: The RecursiveParserWrapper should allow that, no? With the caveat that it caches all output in memory... was (Author: talli...@mitre.org): The RecursiveParserWrapper should allow, that, no? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829 ] Tim Allison edited comment on TIKA-1511 at 1/26/15 2:36 PM: The RecursiveParserWrapper should allow that, no? With the caveat that it caches all output in memory... You should be able to parse the output from the standard recursive XHTML output as well. Right? If you have a chance (and if you haven't done so already), fork branch 1511 from my github site and take a look at the output of the test cases...throw in some print statements and see if that'll work. For testRecursiveParserWrapper(), change BasicContentHandlerFactory.HANDLER_TYPE.BODY to BasicContentHandlerFactory.HANDLER_TYPE.XML. was (Author: talli...@mitre.org): The RecursiveParserWrapper should allow that, no? With the caveat that it caches all output in memory... Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291999#comment-14291999 ] Konstantin Gribov commented on TIKA-1518: - Thank you, [~davemeikle]. It works perfectly, so can be easily used to evaluate Tika. I'll add info to wiki if it isn't there already. Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1530) MP4Parser parses duration but does not set it
Oskar Wickström created TIKA-1530: - Summary: MP4Parser parses duration but does not set it Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1489) PDF Text extraction without permission
[ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292203#comment-14292203 ] Tim Allison edited comment on TIKA-1489 at 1/26/15 7:17 PM: I haven't been able to find standards in XMP or elsewhere. DC's [[accessRights]] and [[rights]] are as close as I could find, but they aren't a good fit. Has anyone had any luck finding a standard? I did just open up MSWord to see what is available there with the current document format. I don't have Information Rights Management (IRM) set up so I can't see exactly what that offers, but it looks like, MSWord has these options: * Read only * Restricted editing ** tracked changes ** comments ** filling in forms ** read only (yes, again) * Restricted Access (this is what I can't experiment with) ** Edit permission ** Copy permission ** Print permission LibreOffice's Writer appears to have: * Read Only * Record Changes There are clearly some overlaps with the permissions allowed in PDF, but there are also some differences. For most of Tika's use cases (I think), we'd want to set a general Tika Metadata key/value for do not extract text if both pdf fields were false or if the MSOffice CopyPermission were false??? ||Application||Permission Name|| |PDF|CanExtractContent| |PDF|CanExtractForAccessibility| |MSOffice|Copy permission| Should we start with PDFBox's AccessPermission as a model and add where necessary from there? was (Author: talli...@mitre.org): I haven't been able to find standards in XMP or elsewhere. Has anyone had any luck? I did just open up MSWord to see what is available there with the current document format. I don't have Information Rights Management (IRM) set up so I can't see exactly what that offers, but it looks like, MSWord has these options: * Read only * Restricted editing ** tracked changes ** comments ** filling in forms ** read only (yes, again) * Restricted Access (this is what I can't experiment with) ** Edit permission ** Copy permission ** Print permission LibreOffice's Writer appears to have: * Read Only * Record Changes There are clearly some overlaps with the permissions allowed in PDF, but there are also some differences. For most of Tika's use cases (I think), we'd want to set a general Tika Metadata key/value for do not extract text if both pdf fields were false or if the MSOffice CopyPermission were false??? ||Application||Permission Name|| |PDF|CanExtractContent| |PDF|CanExtractForAccessibility| |MSOffice|Copy permission| PDF Text extraction without permission -- Key: TIKA-1489 URL: https://issues.apache.org/jira/browse/TIKA-1489 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Tilman Hausherr In TIKA-1442 text extraction from files like 717226.pdf that don't have text extraction permission works. The permissions in PDF files are only enforced by the application (i.e. PDFBox), i.e. the text information isn't stored separately in encrypted form. PDFBox ExtractText command line does throw an exception. So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call used bypasses the permission checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292081#comment-14292081 ] Oskar Wickström commented on TIKA-1530: --- Sure, I'll look into it tomorrow. :) MP4Parser parses duration but does not set it -- Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1489) PDF Text extraction without permission
[ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292204#comment-14292204 ] Tim Allison commented on TIKA-1489: --- We need to respect document permissions before publishing extracted content. PDF Text extraction without permission -- Key: TIKA-1489 URL: https://issues.apache.org/jira/browse/TIKA-1489 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Tilman Hausherr In TIKA-1442 text extraction from files like 717226.pdf that don't have text extraction permission works. The permissions in PDF files are only enforced by the application (i.e. PDFBox), i.e. the text information isn't stored separately in encrypted form. PDFBox ExtractText command line does throw an exception. So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call used bypasses the permission checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292435#comment-14292435 ] Paul Ramirez commented on TIKA-1518: Missed this over the weekend while playing with Docker but yes [~chrismattmann] looks to be what exactly I was thinking. +1 to leaving open until it's in Apache Tika codebase. Dave I will definitely use this for a project and commit updates to it. Docker with Tika Server --- Key: TIKA-1518 URL: https://issues.apache.org/jira/browse/TIKA-1518 Project: Tika Issue Type: New Feature Reporter: Paul Ramirez Fix For: 1.8 This version should be able to demonstrate as many of Apache Tika's capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to show parsers which require installation of other dependencies. In addition, this should help move TIKA-1301 forward and should leverage the suggestion made by [~lewismc] of a script which can pull down the latest version of Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292929#comment-14292929 ] Ivan Ryndin commented on TIKA-1513: --- There are no reliable ways to detect codepage of DBF files. I haven't met DBF specs where codepage is somehow specified with some special byte. The only way to determine codepage is trial and error. --- Possibly there can be one interesting approach to detect codepage similar to that used in language detection. This is statistics based approach. I mean n-gram based language detection methods. I haven't met any ready-to-use framework to detect codepage this way. However, not sure it is worth implementing. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293079#comment-14293079 ] Oskar Wickström commented on TIKA-1521: --- Me too, using OS X 10.10. {code} java version 1.8.0_25 Java(TM) SE Runtime Environment (build 1.8.0_25-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) {code} Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293120#comment-14293120 ] ASF GitHub Bot commented on TIKA-1530: -- GitHub user owickstrom opened a pull request: https://github.com/apache/tika/pull/25 TIKA-1530: Include parsed mp4 duration in metadata Note that I couldn't get all tests working in the project (https://issues.apache.org/jira/browse/TIKA-1521?focusedCommentId=14288719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288719) so I have only run `org.apache.tika.parser.mp4.MP4ParserTest`. If someone else with a working build could try this perhaps? You can merge this pull request into a Git repository by running: $ git pull https://github.com/owickstrom/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/25.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #25 commit 5300d22e6c7f71c353ad84a7cf4534a8efff85da Author: Oskar Wickström oskar.wickst...@live.com Date: 2015-01-27T07:28:00Z TIKA-1530: Include parsed mp4 duration in metadata MP4Parser parses duration but does not set it -- Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: TIKA-1530: Include parsed mp4 duration in metad...
GitHub user owickstrom opened a pull request: https://github.com/apache/tika/pull/25 TIKA-1530: Include parsed mp4 duration in metadata Note that I couldn't get all tests working in the project (https://issues.apache.org/jira/browse/TIKA-1521?focusedCommentId=14288719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288719) so I have only run `org.apache.tika.parser.mp4.MP4ParserTest`. If someone else with a working build could try this perhaps? You can merge this pull request into a Git repository by running: $ git pull https://github.com/owickstrom/tika trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/25.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #25 commit 5300d22e6c7f71c353ad84a7cf4534a8efff85da Author: Oskar Wickström oskar.wickst...@live.com Date: 2015-01-27T07:28:00Z TIKA-1530: Include parsed mp4 duration in metadata --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293124#comment-14293124 ] Oskar Wickström commented on TIKA-1530: --- https://github.com/apache/tika/pull/25 MP4Parser parses duration but does not set it -- Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TIKA-1530) MP4Parser parses duration but does not set it
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oskar Wickström updated TIKA-1530: -- Comment: was deleted (was: https://github.com/apache/tika/pull/25) MP4Parser parses duration but does not set it -- Key: TIKA-1530 URL: https://issues.apache.org/jira/browse/TIKA-1530 Project: Tika Issue Type: Improvement Components: parser Reporter: Oskar Wickström Priority: Minor See the TODO comment at https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1489) PDF Text extraction without permission
[ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292581#comment-14292581 ] Luis Filipe Nassif commented on TIKA-1489: -- If the default behavior of Tika will be changed, please provide a way to change to past behavior. My app is a forensic one and protected content is very relevant to my clients. Thanks PDF Text extraction without permission -- Key: TIKA-1489 URL: https://issues.apache.org/jira/browse/TIKA-1489 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Tilman Hausherr In TIKA-1442 text extraction from files like 717226.pdf that don't have text extraction permission works. The permissions in PDF files are only enforced by the application (i.e. PDFBox), i.e. the text information isn't stored separately in encrypted form. PDFBox ExtractText command line does throw an exception. So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call used bypasses the permission checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)