[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829
 ] 

Tim Allison commented on TIKA-1511:
---

The RecursiveParserWrapper should allow, that, no?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3

2015-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829
 ] 

Tim Allison edited comment on TIKA-1511 at 1/26/15 1:52 PM:


The RecursiveParserWrapper should allow that, no?  With the caveat that it 
caches all output in memory...


was (Author: talli...@mitre.org):
The RecursiveParserWrapper should allow, that, no?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3

2015-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829
 ] 

Tim Allison edited comment on TIKA-1511 at 1/26/15 2:36 PM:


The RecursiveParserWrapper should allow that, no?  With the caveat that it 
caches all output in memory...  You should be able to parse the output from the 
standard recursive XHTML output as well.  Right?

If you have a chance (and if you haven't done so already), fork branch 1511 
from my github site and take a look at the output of the test cases...throw in 
some print statements and see if that'll work.  For 
testRecursiveParserWrapper(), change 
BasicContentHandlerFactory.HANDLER_TYPE.BODY to 
BasicContentHandlerFactory.HANDLER_TYPE.XML.


was (Author: talli...@mitre.org):
The RecursiveParserWrapper should allow that, no?  With the caveat that it 
caches all output in memory...

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2015-01-26 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291999#comment-14291999
 ] 

Konstantin Gribov commented on TIKA-1518:
-

Thank you, [~davemeikle]. It works perfectly, so can be easily used to evaluate 
Tika. 

I'll add info to wiki if it isn't there already.

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread JIRA
Oskar Wickström created TIKA-1530:
-

 Summary: MP4Parser parses duration but does not set it 
 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor


See the TODO comment at 
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1489) PDF Text extraction without permission

2015-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292203#comment-14292203
 ] 

Tim Allison edited comment on TIKA-1489 at 1/26/15 7:17 PM:


I haven't been able to find standards in XMP or elsewhere.  DC's 
[[accessRights]] and [[rights]] are as close as I could find, but they aren't a 
good fit. 

Has anyone had any luck finding a standard?

I did just open up MSWord to see what is available there with the current 
document format.  I don't have Information Rights Management (IRM) set up so I 
can't see exactly what that offers, but it looks like, MSWord has these options:
* Read only
* Restricted editing
** tracked changes
** comments
** filling in forms
** read only (yes, again)
* Restricted Access (this is what I can't experiment with)
** Edit permission
** Copy permission
** Print permission

LibreOffice's Writer appears to have:
* Read Only
* Record Changes

There are clearly some overlaps with the permissions allowed in PDF, but there 
are also some differences.  For most of Tika's use cases (I think), we'd want 
to set a general Tika Metadata key/value for do not extract text if both pdf 
fields were false or if the MSOffice CopyPermission were false???

||Application||Permission Name||
|PDF|CanExtractContent|
|PDF|CanExtractForAccessibility|
|MSOffice|Copy permission|

Should we start with PDFBox's AccessPermission as a model and add where 
necessary from there?


was (Author: talli...@mitre.org):
I haven't been able to find standards in XMP or elsewhere.  Has anyone had any 
luck?

I did just open up MSWord to see what is available there with the current 
document format.  I don't have Information Rights Management (IRM) set up so I 
can't see exactly what that offers, but it looks like, MSWord has these options:
* Read only
* Restricted editing
** tracked changes
** comments
** filling in forms
** read only (yes, again)
* Restricted Access (this is what I can't experiment with)
** Edit permission
** Copy permission
** Print permission

LibreOffice's Writer appears to have:
* Read Only
* Record Changes

There are clearly some overlaps with the permissions allowed in PDF, but there 
are also some differences.  For most of Tika's use cases (I think), we'd want 
to set a general Tika Metadata key/value for do not extract text if both pdf 
fields were false or if the MSOffice CopyPermission were false???

||Application||Permission Name||
|PDF|CanExtractContent|
|PDF|CanExtractForAccessibility|
|MSOffice|Copy permission|

 PDF Text extraction without permission
 --

 Key: TIKA-1489
 URL: https://issues.apache.org/jira/browse/TIKA-1489
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Tilman Hausherr

 In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
 extraction permission works. The permissions in PDF files are only enforced 
 by the application (i.e. PDFBox), i.e. the text information isn't stored 
 separately in encrypted form. 
 PDFBox ExtractText command line does throw an exception.
 So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
 used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292081#comment-14292081
 ] 

Oskar Wickström commented on TIKA-1530:
---

Sure, I'll look into it tomorrow. :)

 MP4Parser parses duration but does not set it 
 --

 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor

 See the TODO comment at 
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

2015-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292204#comment-14292204
 ] 

Tim Allison commented on TIKA-1489:
---

We need to respect document permissions before publishing extracted content.

 PDF Text extraction without permission
 --

 Key: TIKA-1489
 URL: https://issues.apache.org/jira/browse/TIKA-1489
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Tilman Hausherr

 In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
 extraction permission works. The permissions in PDF files are only enforced 
 by the application (i.e. PDFBox), i.e. the text information isn't stored 
 separately in encrypted form. 
 PDFBox ExtractText command line does throw an exception.
 So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
 used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2015-01-26 Thread Paul Ramirez (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292435#comment-14292435
 ] 

Paul Ramirez commented on TIKA-1518:


Missed this over the weekend while playing with Docker but yes [~chrismattmann] 
looks to be what exactly I was thinking. +1 to leaving open until it's in 
Apache Tika codebase. Dave I will definitely use this for a project and commit 
updates to it.

 Docker with Tika Server
 ---

 Key: TIKA-1518
 URL: https://issues.apache.org/jira/browse/TIKA-1518
 Project: Tika
  Issue Type: New Feature
Reporter: Paul Ramirez
 Fix For: 1.8


 This version should be able to demonstrate as many of Apache Tika's 
 capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
 show parsers which require installation of other dependencies. In addition, 
 this should help move TIKA-1301 forward and should leverage the suggestion 
 made by [~lewismc] of a script which can pull down the latest version of 
 Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-26 Thread Ivan Ryndin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292929#comment-14292929
 ] 

Ivan Ryndin commented on TIKA-1513:
---

There are no reliable ways to detect codepage of DBF files. I haven't met DBF 
specs where codepage is somehow specified with some special byte.
The only way to determine codepage is trial and error.
---
Possibly there can be one interesting approach to detect codepage similar to 
that used in language detection. This is statistics based approach. I mean 
n-gram based language detection methods. I haven't met any ready-to-use 
framework to detect codepage this way. However, not sure it is worth 
implementing.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293079#comment-14293079
 ] 

Oskar Wickström commented on TIKA-1521:
---

Me too, using OS X 10.10.
{code}
java version 1.8.0_25
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
{code}

 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293120#comment-14293120
 ] 

ASF GitHub Bot commented on TIKA-1530:
--

GitHub user owickstrom opened a pull request:

https://github.com/apache/tika/pull/25

TIKA-1530: Include parsed mp4 duration in metadata

Note that I couldn't get all tests working in the project 
(https://issues.apache.org/jira/browse/TIKA-1521?focusedCommentId=14288719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288719)
 so I have only run `org.apache.tika.parser.mp4.MP4ParserTest`. If someone else 
with a working build could try this perhaps?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/owickstrom/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/25.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #25


commit 5300d22e6c7f71c353ad84a7cf4534a8efff85da
Author: Oskar Wickström oskar.wickst...@live.com
Date:   2015-01-27T07:28:00Z

TIKA-1530: Include parsed mp4 duration in metadata




 MP4Parser parses duration but does not set it 
 --

 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor

 See the TODO comment at 
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1530: Include parsed mp4 duration in metad...

2015-01-26 Thread owickstrom
GitHub user owickstrom opened a pull request:

https://github.com/apache/tika/pull/25

TIKA-1530: Include parsed mp4 duration in metadata

Note that I couldn't get all tests working in the project 
(https://issues.apache.org/jira/browse/TIKA-1521?focusedCommentId=14288719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14288719)
 so I have only run `org.apache.tika.parser.mp4.MP4ParserTest`. If someone else 
with a working build could try this perhaps?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/owickstrom/tika trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/25.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #25


commit 5300d22e6c7f71c353ad84a7cf4534a8efff85da
Author: Oskar Wickström oskar.wickst...@live.com
Date:   2015-01-27T07:28:00Z

TIKA-1530: Include parsed mp4 duration in metadata




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293124#comment-14293124
 ] 

Oskar Wickström commented on TIKA-1530:
---

https://github.com/apache/tika/pull/25

 MP4Parser parses duration but does not set it 
 --

 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor

 See the TODO comment at 
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oskar Wickström updated TIKA-1530:
--
Comment: was deleted

(was: https://github.com/apache/tika/pull/25)

 MP4Parser parses duration but does not set it 
 --

 Key: TIKA-1530
 URL: https://issues.apache.org/jira/browse/TIKA-1530
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Oskar Wickström
Priority: Minor

 See the TODO comment at 
 https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L167



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

2015-01-26 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292581#comment-14292581
 ] 

Luis Filipe Nassif commented on TIKA-1489:
--

If the default behavior of Tika will be changed, please provide a way to change 
to past behavior. My app is a forensic one and protected content is very 
relevant to my clients.

Thanks

 PDF Text extraction without permission
 --

 Key: TIKA-1489
 URL: https://issues.apache.org/jira/browse/TIKA-1489
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Tilman Hausherr

 In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
 extraction permission works. The permissions in PDF files are only enforced 
 by the application (i.e. PDFBox), i.e. the text information isn't stored 
 separately in encrypted form. 
 PDFBox ExtractText command line does throw an exception.
 So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
 used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)