[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-04-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502684#comment-14502684
 ] 

Hudson commented on TIKA-1511:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #634 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/634/])
TIKA-1511, move xerial dependency to 'provided' (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1674800)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-app/src/main/appended-resources/META-INF/LICENSE
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/appended-resources/META-INF/LICENSE


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386562#comment-14386562
 ] 

Tim Allison commented on TIKA-1511:
---

Thank you, [~thetaphi].  I was aware of about half of that, but I'm very 
grateful to have the full story from an expert and to know that I won't break 
Solr.  

I agree about the benefits of segregating parsers.  As Konstantin pointed out, 
we're trying to head in that direction.

Thank you, again!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386648#comment-14386648
 ] 

Hudson commented on TIKA-1511:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #583 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/583/])
TIKA-1511 include xerial and native libs; some cleanup of README in preparation 
for 1.8 release (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670069)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/pom.xml


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387865#comment-14387865
 ] 

Hudson commented on TIKA-1511:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #589 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/589/])
TIKA-1511: add public domain license notice for Sqlite to main License.txt 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670239)
* /tika/trunk/LICENSE.txt


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-29 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385802#comment-14385802
 ] 

Konstantin Gribov commented on TIKA-1511:
-

+1 for including xerial in tika-app and tika-server.

If you want to include it in tika-parsers as non-provided/optional dep, we 
should have explicit note about presence of native libs in tika-parsers.
Than it'll be ok.

As I know, Solr 5.0+ is not classic webapp (as were before) but standalone app 
and shouldn't have such classloading issues, since it's parts aren't redeployed 
while solr is running.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-29 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385803#comment-14385803
 ] 

Uwe Schindler commented on TIKA-1511:
-

Solr uses ANT + IVY to build. We don't use transitive dependencies at all! So 
whenever updating TIKA, the person who does this prints the dependency tree and 
then fills all required information into the ivy.xml file and our 
ivy-versions.properties file :-) In general, we carefully decide, which 
dependencies are really needed. Because TIKA automatically disables parser 
which do not load, we have already removed various files (like netcdf parser - 
 LGPL) or the ASM parser (we dont support indexing Java Class files by 
default).

For the current one: We dont want to have native libraries anywhere (we don't 
even ship our own native libs for WindowsDirectory). Users need to do this 
themselves start msvcc/gcc. So we would not ship wth SQLite support by default.

In general it would be good to have some easier plugin mechanism to allow Solr 
to pick only some parsers they ship by default and those the user can download 
(e.g. by a script). So it would be good to have multiple parser-JARS. So maybe 
put all crazy parsers that fork processes or call native libs into a separate 
TIKA parser bundle. The default one should only have pure-java stuff with as 
few dependencies as possible...

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-29 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385836#comment-14385836
 ] 

Konstantin Gribov commented on TIKA-1511:
-

Idea of better tika-parsers module separation was dicussed some time ago, it's 
also mentioned in Tika 2.0 roadmap 
(https://wiki.apache.org/tika/Tika2_0RoadMap).

In such case, user would get appropriate {{tika-parsers-*}} modules with their 
deps (e. g., via {{mvn dependency:copy}} or something similar) and Solr can 
depend only on {{tika-core}} and minimal {{tika-parsers-*}}. Or with dependency 
only on {{tika-core}} but it will lead to statndard questions like why it 
doesn't work as with {{slf4j}} in solr4.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385773#comment-14385773
 ] 

Tim Allison commented on TIKA-1511:
---

Any objections to including xerial with app and server rather than provided? 
We can include instructions for excluding for os not supported or webapps with 
security/native lib restrictions.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385782#comment-14385782
 ] 

Tim Allison commented on TIKA-1511:
---

[~thetaphi], will there be any problems for Solr if we remove provided for 
xerial in parsers' pom?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320059#comment-14320059
 ] 

Konstantin Gribov commented on TIKA-1511:
-

With v3 patch forbiddenapis found that {{SQLite3RowReader}} use 
{{SimpleDateFormat}} without explicit {{locale}} set. I hope, it's enough to 
use {{Locale.getDefault()}}.

Also, fixed {{TestSQLLiteParser}}: it tried to load absent test resourse, seems 
that it was renamed to {{testSqlite3b.db}}. Tests successfully pass with it.

Do we also need {{testSQLITE3.db}} in {{tika-parsers}}? I can't find any test 
that use this file.


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320088#comment-14320088
 ] 

Tim Allison commented on TIKA-1511:
---

[~gagravarr], I can't find a mime test that uses testSQLITE3.db in various revs 
of {{TestMimeTypes.java}} for TIKA-1502.  Did you add one at some point?  If 
not, should we remove that file and add a mime test for the sqlite test file 
that I added for the parser?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320103#comment-14320103
 ] 

Konstantin Gribov commented on TIKA-1511:
-

[~talli...@mitre.org], r1659547 work fine. Tests for sqlite3 pass. Thanks)

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320104#comment-14320104
 ] 

Konstantin Gribov commented on TIKA-1511:
-

[~talli...@mitre.org], r1659547 work fine. Tests for sqlite3 pass. Thanks)

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320107#comment-14320107
 ] 

Tim Allison commented on TIKA-1511:
---

Great.  Thank you.  Let me know if we should make any changes in the format of 
the output or if there are any surprises.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320073#comment-14320073
 ] 

Tim Allison commented on TIKA-1511:
---

Oh, wait, those were errors in v3 of my patch attached here.  

I made several changes from v3 before committing.  You shouldn't see the 
misspelled testSQLLite3b.db in trunk, and I fixed the date format before 
committing.  Let me know if you see these in trunk.  I don't.

On testSQLITE3.db, that was added for a mime test.  I'm looking into r1647473 
and its history now to see where that test was/is.  On the theory of do no 
harm, I chose not to remove that or replace it.



 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320194#comment-14320194
 ] 

Tim Allison commented on TIKA-1511:
---

Very cool.  Thank you for checking on that.  Looks like the issue is only a 
Windows issue:I get e.g. 
{{sqlite-3.8.7-2ee1c7aa-2ec8-47ad-bf74-073acc79a850-sqlitejdbc.dll}} each time 
I run Tika and it hits a sqlite3 file, and they are not deleted.

If we'd prefer to include xerial's jar with our bundle to make integration 
easier (for those not in webapp environments :) ), I'm happy to make the change.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320042#comment-14320042
 ] 

Tim Allison commented on TIKA-1511:
---

Mea culpa.  Give r1659547 a try.

What would be the benefit of optional vs supplied?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320063#comment-14320063
 ] 

Tim Allison commented on TIKA-1511:
---

Will fix now.  No idea how my tests passed with those errors...Thank you.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320183#comment-14320183
 ] 

Konstantin Gribov commented on TIKA-1511:
-

I don't see a lot of {{/tmp/sqlite-*.so}} files, only one while db is open. 
After closing connections/db it is removed automagically.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320354#comment-14320354
 ] 

Tim Allison commented on TIKA-1511:
---

Would anyone be able to offer help on this one?

Are permissions issues preventing xerial's wrapper from writing the .so files 
to Jenkins' temp folder?

{noformat}
Error Message

org.sqlite.core.NativeDB._open(Ljava/lang/String;I)V
Stacktrace

java.lang.UnsatisfiedLinkError: 
org.sqlite.core.NativeDB._open(Ljava/lang/String;I)V
at org.sqlite.core.NativeDB._open(Native Method)
at org.sqlite.core.DB.open(DB.java:161)
at org.sqlite.core.CoreConnection.open(CoreConnection.java:145)
at org.sqlite.core.CoreConnection.init(CoreConnection.java:66)
at org.sqlite.jdbc3.JDBC3Connection.init(JDBC3Connection.java:21)
at org.sqlite.jdbc4.JDBC4Connection.init(JDBC4Connection.java:23)
at org.sqlite.SQLiteConnection.init(SQLiteConnection.java:45)
at org.sqlite.JDBC.createConnection(JDBC.java:114)
at org.sqlite.SQLiteConfig.createConnection(SQLiteConfig.java:101)
{noformat}


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320360#comment-14320360
 ] 

Tim Allison commented on TIKA-1511:
---

Perhaps revert to 3.8.6 according to 
[this|https://bitbucket.org/xerial/sqlite-jdbc/issue/152/387-version-linux-issue]?


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320371#comment-14320371
 ] 

Tim Allison commented on TIKA-1511:
---

reverted to 3.8.6 in r1659598.

If anyone has an ubuntu machine and wants to try reverting until we have 
success, that would be better than me trying through Hudson.  :)

Let's see if 3.8.6 is the charm.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320307#comment-14320307
 ] 

Hudson commented on TIKA-1511:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #489 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/489/])
TIKA-1511, third time is the charm...many apologies (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1659547)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/metadata/Database.java
TIKA-1511, with new files added...doh (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1659545)
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/jdbc
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/AbstractDBParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/JDBCTableReader.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3DBParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3TableReader.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/jdbc
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320678#comment-14320678
 ] 

Tim Allison commented on TIKA-1511:
---

Reverting to an earlier version of sqlite-jdbc worked, but I find it 
unsettling.  Do we want to include this parser as part of the standard distro 
or should we offer it as a third party parser?  The licenses are good, but 
dependencies on native libs give me some concern...especially after that 
experience.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320617#comment-14320617
 ] 

Hudson commented on TIKA-1511:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #490 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/490/])
TIKA-1511 try to revert to earlier version of sqlite-jdbc to avoid 
unsatisfiedlikeerror on ubuntu (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1659598)
* /tika/trunk/tika-parsers/pom.xml


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320797#comment-14320797
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

As There are native libs only for Windows, Linux and MacOs X, maybe adding a 
check for them into getSupportedTypes could make the parser more robust? 

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320022#comment-14320022
 ] 

Konstantin Gribov commented on TIKA-1511:
-

[~talli...@mitre.org], you can also make it {{optionaltrue/optional}} 
instead of {{provided}}.

Also, I can't find parser itself 
({{org.apache.tika.parser.jdbc.SQLite3Parser}})in trunk rev 1659449.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319457#comment-14319457
 ] 

Hudson commented on TIKA-1511:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #487 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/487/])
TIKA-1511 add parser for sqlite3 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1659449)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-bundle/pom.xml
* /tika/trunk/tika-parsers/pom.xml
* /tika/trunk/tika-parsers/src/main/appended-resources/META-INF/LICENSE
* 
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testSqlite3b.db


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-02-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318232#comment-14318232
 ] 

Tim Allison commented on TIKA-1511:
---

Bottom line: it will be simpler to treat the full db with all tables as one big 
file.  We can still treat clobs and blobs as embedded documents.

Details:
When I tried to cut out the {{JDBCInputStream}} and just send in a zero byte 
{{InputStream}}, regular parsing worked properly.

However, if a user tries to use a {{ParserContainerExtractor}}, that fails to 
reach the BLOBs because of this:
{code}
MediaType type = detector.detect(tis, metadata);

if (extractor == null) {
// Let the handler process the embedded resource 
handler.handle(filename, type, tis);
} else {
// Use a temporary file to process the stream twice
File file = tis.getFile();

// Let the handler process the embedded resource
InputStream input = TikaInputStream.get(file);
try {
handler.handle(filename, type, input);
} finally {
input.close();
}

// Recurse
extractor.extract(tis, extractor, handler);
}
{code}

When the extractor is called below the {{//Recurse}} comment, it only sees the 
zero-byte {{TikaInputStream}}. It does not see the {{type}} or the 
{{metadata}}.  So, in the case of {{AutoDetectParser}}, it only sees a zero 
byte {{InputStream}} and therefore detects it as {{application/octet-stream}}.  
In short, there is no current way to pass the detected type through to the 
extractor.  We could, of course, add a parameter for {{type}} or {{metadata}} 
to the ParserContainerExtractor's {{extract}} signature...


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298791#comment-14298791
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

Hi [~talli...@apache.org], I am ok to remove the virtual csv/html inputStream 
(there is no embedded table stream as you pointed before), but I think it is 
strange an inputStream that can not be read. Maybe back off to the big doc 
approach... What are the advantages of handling each table like an embedded doc?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299000#comment-14299000
 ] 

Tim Allison commented on TIKA-1511:
---

From a search perspective, the search experience is typically better with 
smaller documents than with enormous docs.

As for the oddity, y, I agree, but we do it in AbstractPOIFSExtractor.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299171#comment-14299171
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

For search could we split the big xhtml output with a contentHandlerDecorator?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299175#comment-14299175
 ] 

Tim Allison commented on TIKA-1511:
---

We could...I'm more inclined to go with the RecursiveParserWrapper, but parsing 
should work.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298670#comment-14298670
 ] 

Tim Allison commented on TIKA-1511:
---

Thank you, Nick, for reviewing this!  I'll fix the wildcards...not sure how 
those crept in and the assertContains...

I'm not happy with the added complexity of the JDBCInputStream.

Bottom line: should we get rid of that option and back off to a zero-byte 
InputStream and grabbing the table object from the OpenContainer?  That would 
simplify quite a bit, including detection... And, it would make this parser 
behave like the PST parser...I think.  If we really want to add it later, we 
can, but simpler is better...

[~lfcnassif], would you be ok with that proposal?

As for another jdbc-based format, I completely agree.  Can you recommend 
another single-file db format?  Access comes to mind, but I can't find a pure 
Java parser that has jdbc: Jackcess (LGPL) has its own api and doesn't support 
jdbc.  I looked briefly at derby, hsqldb, mysql, and they all seem to rely on a 
directory of files...I very well could have missed a single file option for 
those, though...





 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298702#comment-14298702
 ] 

Tim Allison commented on TIKA-1511:
---

h2 appears to be MPL _or_ EPL.  According to [apache legal 
faq|http://www.apache.org/legal/resolved.html], MPL 2.0 is good as long as we 
include the license info and the disclaimer.

So, h2 should work, no?  Other candidates?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298416#comment-14298416
 ] 

Nick Burch commented on TIKA-1511:
--

Few minor things on Tim's github branch for this - I'm seeing some wildcard 
imports being added, and some assertContains being replaced with 
assertTrue(str.contains) - the latter doesn't give as helpful an exception for 
the assert failing. Does the branch need updating, or are there spurious 
changes that've come in?

I've had a quick look at the diff to the branch, but not a full one. My initial 
impression is that there was more logic than I'd expected in 
JDBCResultSetInputStream and JDBCRowReader, but necessarily a problematic 
amount. I'm still not entirely sure of the idea that depending on how you 
access the embedded stream, you get different behaviour. If you have a Word 
document embedded in a PDF, the embedded stream doesn't say I'll give you Word 
if you ask one way, Plain Text if you ask another, it just says here's the 
content type, you'll need to find a suitable parser or fail trying

For the specific use case of something that iterates through a file, dumping 
out all embedded resources without parsing them, if we do support it for these 
JDBC tables (I'm tempted to say for that use case we don't return anything for 
the table), we could just have a special case wrapper which parses to HTML as 
normal and returns that, rather than messing around with maybe html via jdbc, 
maybe magically csv

Also, it'd be good if we could have implementations for 2 different jdbc-based 
formats if we can. That should help us verify we've got the split between 
abstract jdbc and sqlite parts correct!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-27 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294402#comment-14294402
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

Hum... RecursiveParserWrapper is very cool! I did not have a chance to look at 
it before, thank you. Currently I am doing something similar with a custom 
EmbeddedDocumentExtractor. For sure RecursiveParserWrapper can help with that 
use case!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829
 ] 

Tim Allison commented on TIKA-1511:
---

The RecursiveParserWrapper should allow, that, no?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-25 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291148#comment-14291148
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

My specific use case is to produce a single xHtml file for each table that can 
be displayed to user.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290303#comment-14290303
 ] 

Tim Allison commented on TIKA-1511:
---

I'm not sure I understand the need for that.  Won't you be able to send in 
whatever handler you want via the regular call to parse and by attaching a 
ParsingEmbeddedDocumentExtractor?  What, exactly, do you want to have when Tika 
has finished processing the Sqlite file?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285621#comment-14285621
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

No problems, the desing looks good!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, 
 testSQLLite3b.db, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285568#comment-14285568
 ] 

Tim Allison commented on TIKA-1511:
---

{quote}
A) I think it will work, as the patch works now. But I think an inputStream 
that can not be read is a bit strange.
{quote}
Agreed.  The new proposal is to make the InputStream readable, but the regular 
use case of an AutoDetectParser sent in via ParseContext won't bother to read 
the InputStream, rather, it will read the table object and use the 
user-supplied ContentHandler.

{quote}
B) Could it be better to send a xHTML inputStream with markup to client instead 
of simple UTF-8 encoded CSV?
{quote}
We could, but there are other ways of getting that...RecursiveParserWrapper or 
custom recursive embedded parser handler or even just sending in the plain 
AutoDetectParser as the EmbeddedDocumentExtractor/Parser in ParseContext.  The 
idea behind this is to support a ParserContainerExtractor that would normally 
pull just the bytes from embedded documents...because there are no bytes for a 
table object (i.e. it never exists as an actual standalone file), I propose a 
csv proxy.

{quote}
C) I agree, but it will work only if he adds the correct parser (eg TableParser 
or CompositeParser) to ParseContext, right?
{quote}
The user will have to add an AutoDetectParser to the ParseContext, and we will 
need to add org.apache.tika.parser.jdbc.SQLite3Parser
org.apache.tika.parser.jdbc.JDBCTableParser
to the parser services file. 

I have a draft of this proposal working.  The current downside is that if the 
client resets and rereads the InputStream, the blobs/clobs are processed twice 
via the EmbeddedDocumentExtractor.  

Any problems with the above?  Recommendations for an alternate design?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283883#comment-14283883
 ] 

Tim Allison commented on TIKA-1511:
---

Hi [~lfcnassif], Based on your point about  the tika-app's -z option and its 
FileEmbeddedDocumentExtractor that just copies bytes from the InputStream to a 
file, I propose the following.  I have a strong preference to treat each table 
as an embedded file, but if it isn't possible, it isn't possible.

So, the proposal for making use of classes that implement 
EmbeddedDocumentExtractor for each table:

A) If the EmbeddedDocumentExtractor is a parsing EmbeddedDocumentExtractor, the 
correct parser will be called, and it will grab a JDBC object from the a 
wrapper/modification of TikaInputStream...it will not actually read the 
InputStream at all.  The output will go into whatever handler is passed in.

B) If a client reads the bytes from the input stream, they'll get a UTF-8 
encoded CSV InputStream, without BLOBs and CLOBs...the 
EmbeddedDocumentExtractor will be called for each individual BLOB and CLOB.

C) If a client uses the basic pattern of adding a Parser to the ParseContext, 
they'll get one big file with markup for the different div.  

D) If a client uses the RecursiveParserWrapper (not recommended for large 
dbs!), there will be one metadata object for each table, and one metadata 
object for each BLOB and CLOB...in short, potentially a large number of 
embedded documents.

I'll mock up this plan and attach a patch if this sounds reasonable.

If this does work out, we might consider refactoring the PSTParser to treat 
individual emails in a similar way.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-20 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283988#comment-14283988
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

Hi [~talli...@apache.org]. First, you're doing a great job, thank you. I only 
want to help with some ideas, because I will not have time in near future to 
help with the patch.

A) I think it will work, as the patch works now. But I think an inputStream 
that can not be read is a bit strange.
B) Could it be better to send a xHTML inputStream with markup to client instead 
of simple UTF-8 encoded CSV?
C) I agree, but it will work only if he adds the correct parser (eg TableParser 
or CompositeParser) to ParseContext, right?
D) I agree, that would be great.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280433#comment-14280433
 ] 

Tim Allison commented on TIKA-1511:
---

H... This will fail if someone sends in a custom EmbeddedDocumentExtractor 
because there is no way to pass the StatementTablePair to that interface via 
ParseContext. 

Some options:
1) We could go back to treating the db as one big doc, as we do with xls, but I 
think I'd prefer to treat each table as a separate doc.

2) We could get rid of the StatementTablePair hack, extract the text from each 
table into a String and then pass that into EmbeddedDocumentExtractor as the 
InputStream.  The drawback to this is that we'd ignore the handler and lose 
potential tr td markup

 Any ideas on this?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-16 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280553#comment-14280553
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

I think it will fail if someone sends in a custom EmbeddedDocExtractor (EDE) 
because it will probably try to read from the empty ByteArrayInputStream to get 
the table. The StatementTablePair wil be there but could not be searched for 
into parseContext.

1) I prefer to handle each table as an embedded doc too, if it is possible. If 
not, lets go back.

2) Is it possible to generate a HTML representation of the tables and pass it 
into EDE? By default could it be handled by HtmlParser? Does HtmlParser 
currently extract embedded docs, like images? Can we insert the BLOBs into that 
HTML so that the HtmlParser will extract those BLOBs?

If this approach is possible, we can use pipedWriter and pipedReader to not 
hold the entire HTML/Tables in memory, possibly huge ones.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280829#comment-14280829
 ] 

Tim Allison commented on TIKA-1511:
---

Y, the HTML representation is generated by wrapping the handler in an 
XHTMLHandler as other parsers do, and in v2 of the patch, this actually works.  
No need to get HtmlParser involved.

If you want plain text, use a BodyContentHandler.

I may be missing your point on HtmlParser and PipedReader/Writer.

I added two tests that just print out the output from standard AutoDetectParser 
and from a RecursiveParserWrapper that wraps AutoDetect...let me know what you 
think.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-16 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not 
be read, I think using EDE is not useful. How will this approach work with 
TikaCli --extract option? My original idea was to support an use case like 
TikaCli --extract...

Now I think this extraction of tables to files can be done handling the db as 
one big doc and using a ContentHandlerDecorator that will split the xhtml 
output at table bondaries. Each xhtml segment can be converted to a byte[] (if 
small) and then to a ByteArrayInputStream that can be passed to a 
EmbeddedDocDecorator, if set on parseContext. If not set the 
ContentHandlerDecorator do not need to split tables and can fallBack to default 
behavior. A custom EDE can then extract tables to files if desired.

So now I think we could go with the big doc approah. What do you think?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8

 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279748#comment-14279748
 ] 

Tim Allison commented on TIKA-1511:
---

Sounds good, y, I think the user will have to handcraft depth handling for now.

Question for the community...

To call the EmbeddedDocumentExtractor for each table, I can't just pass it an 
InputStream -- there is no InputStream, just a Connection and a table name 
against which to run the select * from tablename.

One solution would be to create a special mime-type, 
tika-internal/jdbc-table, and then a JDBCTableParser that supports that 
mime-type, but pulls a ConnectionTableNamePair (or something?) from the 
ParseContext.

Other ideas?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-14 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277044#comment-14277044
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

1) I vote to handle each table as a separate/embedded item with 
EmbeddedDocumentExtractor. If the user do not set a EmbeddedDocumentExtractor 
into ParseContext, the parser should fallback to 
ParsingEmbeddedDocumentExtractor that will simply append all tables with div. 
So the parser will be more flexible.
2) I think the same can be applied here.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-14 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277080#comment-14277080
 ] 

Konstantin Gribov commented on TIKA-1511:
-

[~talli...@mitre.org], working with tables as separate files looks good. Maybe, 
also migrate excel parsing to same behavior. Having consistent behavior is good 
from less surprise principle point.

Treating BLOBs as embedded document gives library user ability to configure 
it's detection, parsing and extration via {{ParserContext}}, AFAIK. E. g. Tika 
user can just detect MIME-type (and, maybe, metadata) when parsing database 
table.

But this lead to one issue, user may want different behavior for different 
levels of embedded document, e.g. parse first level (table) and only extract 
metadata for second (blob in some field). For me it'll be a real case in some 
projects. In such case user may want to pass some {{ParserContext}} or factory 
for it to {{EmbeddedDocumentExtractor}}. So, such improvement can be done after.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275605#comment-14275605
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

I think the jdbc based AbstractClass is a great route!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275703#comment-14275703
 ] 

Konstantin Gribov commented on TIKA-1511:
-

[~lfcnassif], +1. IMHO, ManifoldCF connectors are quite heavy dependency. 
{{tika-app.jar}} is about 30MiB now.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275217#comment-14275217
 ] 

Tim Allison commented on TIKA-1511:
---

Thank you, [~grossws]!  

Two questions:

1) On how to exclude the native libs...is it ok to require that people 
re-bundle, that is just get rid of the dependency in the pom and build from 
scratch? Is there a cleaner method?

2) Would it be better to require users who want SQLLite3 parsing to add xerial 
to their classpath?  

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275286#comment-14275286
 ] 

Konstantin Gribov commented on TIKA-1511:
-

Usual way is to exclude maven dependency and add check some {{xerial}} class 
presence before using it in appropriate Tika parser (i. e. call 
{{Class.forName(org.sqlite.JDBC)}} and catch {{ClassNotFoundException}}). I 
don't know how consistently {{tika-parsers}} uses this approach.

Native libs are usually stored in same jar (build for all supported platforms), 
so excluding {{sqlite-jdbc.jar}} prevents loading sqlite native library from it.

E.g. if I don't need, say, netcdf parsers when invoking tika I can add such 
snippet to my {{pom.xml}}:

{code:xml}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.6/version
  exclusions
exclusion
  groupIdedu.ucar/groupId
  artifactIdnetcdf/artifactId
/exclusion
  /exclusions
/dependency
{code}

So, tika library user don't need to rebuild tika-parsers, store it somewhere 
and can use prebuild tika release from maven central.

Same pattern can be used with other libs, splitting them into two buckets:
- with Apache-compatible license, which can be included in {{tika-parsers.jar}} 
artifact,
- with license which prevents packaging it with Tika and documentation info 
about such parsers/detectors availability if user add them to classpath.

Such approach is generic and not related to libs with jni. E.g. it allows 
someone to use proprietary or copyleft (GNU GPL/LGPL) library if it's allowed 
from legal side. I'm not a lawyer, so I don't know will compile-time dependency 
on some library with Apache-incompatible license infringe someones copyright or 
not.

Disclaimer: I'm not a lawyer. My thoughts above aren't legal advice. I think, 
legal advice from ASF should be formally received before including ever 
optional dependencies on some Apache License incompatible thrid-party libs.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275182#comment-14275182
 ] 

Konstantin Gribov commented on TIKA-1511:
-

JNI can potentially give some issues in webapp container/appserver and 
environments with security manager turned on. I think it should be at least 
mentioned in docs if we use native libs in Tika and documented how to exclude 
them.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275112#comment-14275112
 ] 

Tim Allison commented on TIKA-1511:
---

Thank you for looking into that.  I like the bundling of native libs so that 
users shouldn't have to worry.  Do you see any potential problems from a 
technical standpoint with xerial's wrapper/jar?

[~gagravarr], [this|http://bitbucket.org/xerial/sqlite-jdbc] looks good to me.  
Do you still recommend checking with Legal?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275361#comment-14275361
 ] 

Tim Allison commented on TIKA-1511:
---

Completely agree with this...that was the plan, esp for those that are 
explicitly not Apache.
{quote}
 (i. e. call Class.forName(org.sqlite.JDBC) and catch ClassNotFoundException)
{quote}

On ucar, got it, I'll follow that model (the excludes statements in app, 
server and bundle poms) for SQLite if we get a negative decision from LEGAL and 
for any other db drivers/native code that are explicitly not Apache.

Thank you, again!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275485#comment-14275485
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

Another library option is https://code.google.com/p/sqlite4java/ It is not a 
jdbc driver, but also depends on native libs.

Maybe a jdbc driver like xerial would be better because we can be database 
independent and reuse code to other formats (dbf, mdb...)?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-13 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275508#comment-14275508
 ] 

Nick Burch commented on TIKA-1511:
--

If we're going to do a general jdbc option, maybe we'd be better off having an 
optional module that just wraps Apache ManifoldCF? ManifoldCF provides 
connectors / extractors for JDBC amongst others

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274159#comment-14274159
 ] 

Nick Burch commented on TIKA-1511:
--

Just to be sure, since SQLite doesn't show up in the [Apache Legal FAQ 
list|http://www.apache.org/legal/resolved.html], it'd probably be worth raising 
a legal jira (link from [the legal 
page|http://www.apache.org/legal/resolved.html) just to get confirmation that 
it's fine to use + clarify what (if any) notice entry is needed for it

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274135#comment-14274135
 ] 

Tim Allison commented on TIKA-1511:
---

Agreed on the license.

I'm able to create and write to a sqlite db with just the jar from maven:

{noformat}
dependency
  groupIdorg.xerial/groupId
  artifactIdsqlite-jdbc/artifactId
  version3.8.7/version
/dependency
{noformat}

I don't think I have native libs kicking around my system somewhere, or do I? 

This will add another 4 MB to tika-app/tika-server, but I think that it is 
worth it...


 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor

 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274612#comment-14274612
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

Yes, there are native libs for windows, mac and linux packed into xerial 
sqlite-jdbc-3.8.7.jar, but there are other wrappers if that is a problem. The 
license for xerial-jdbc is Apache v2.

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
 Fix For: 1.8


 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273869#comment-14273869
 ] 

Tim Allison commented on TIKA-1511:
---

See any licensing problems with bundling sqlite dependency?  It isn't Apache 
v2, but what we'd bundle isn't licensed at all 
([link|https://www.sqlite.org/copyright.html]).

I don't see a problem, but wanted to check to see if anyone has any issues. 

Thank you for opening this issue!

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor

 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273905#comment-14273905
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

I don't see any problems too. I think public domain is more liberal than 
apache v2, because the authors abdicated their copyright.

But sqlite needs native libs. Could it be a poblem?

 Create a parser for SQLite3
 ---

 Key: TIKA-1511
 URL: https://issues.apache.org/jira/browse/TIKA-1511
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
Priority: Minor

 I think it would be very useful, as sqlite is used as data storage by a wide 
 range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)