[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385836#comment-14385836 ] Konstantin Gribov edited comment on TIKA-1511 at 3/29/15 4:05 PM: -- Idea of better tika-parsers module separation was dicussed some time ago, it's also mentioned in Tika 2.0 roadmap (https://wiki.apache.org/tika/Tika2_0RoadMap). In such case, user would get appropriate {{tika-parsers-\*}} modules with their deps (e. g., via {{mvn dependency:copy}} or something similar) and Solr can depend only on {{tika-core}} and minimal {{tika-parsers-\*}}. Or with dependency only on {{tika-core}} but it will lead to statndard questions like why it doesn't work as with {{slf4j}} in solr4. was (Author: grossws): Idea of better tika-parsers module separation was dicussed some time ago, it's also mentioned in Tika 2.0 roadmap (https://wiki.apache.org/tika/Tika2_0RoadMap). In such case, user would get appropriate {{tika-parsers-*}} modules with their deps (e. g., via {{mvn dependency:copy}} or something similar) and Solr can depend only on {{tika-core}} and minimal {{tika-parsers-*}}. Or with dependency only on {{tika-core}} but it will lead to statndard questions like why it doesn't work as with {{slf4j}} in solr4. Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, TIKA-1511v3bis.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320022#comment-14320022 ] Konstantin Gribov edited comment on TIKA-1511 at 2/13/15 12:40 PM: --- [~talli...@mitre.org], you can also make it {{optionaltrue/optional}} instead of {{provided}}. Also, I can't find parser itself ({{org.apache.tika.parser.jdbc.SQLite3Parser}}) in trunk rev 1659449. was (Author: grossws): [~talli...@mitre.org], you can also make it {{optionaltrue/optional}} instead of {{provided}}. Also, I can't find parser itself ({{org.apache.tika.parser.jdbc.SQLite3Parser}})in trunk rev 1659449. Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298670#comment-14298670 ] Tim Allison edited comment on TIKA-1511 at 1/30/15 2:25 PM: Thank you, Nick, for reviewing this! I'll fix the wildcards...not sure how those crept in and the assertContains... I'm not happy with the added complexity of the JDBCInputStream. Bottom line: should we get rid of that option and back off to a zero-byte InputStream and grabbing the table object from the OpenContainer? That would simplify quite a bit, including detection... And, it would make this parser behave like the PST parser...I think. If we really want to add it later, we can, but simpler is better... [~lfcnassif], would you be ok with that proposal? As for another jdbc-based format, I completely agree. Can you recommend another single-file db format? Access comes to mind, but I can't find a pure Java parser that has jdbc: Jackcess (LGPL) has its own api and doesn't support jdbc. I looked briefly at derby, hsqldb, mysql, and they all seem to rely on a directory of files...I very well could have missed a single file option for those, though... Maybe h2 (MPL and EPL [licenses|http://www.h2database.com/html/license.html])? was (Author: talli...@mitre.org): Thank you, Nick, for reviewing this! I'll fix the wildcards...not sure how those crept in and the assertContains... I'm not happy with the added complexity of the JDBCInputStream. Bottom line: should we get rid of that option and back off to a zero-byte InputStream and grabbing the table object from the OpenContainer? That would simplify quite a bit, including detection... And, it would make this parser behave like the PST parser...I think. If we really want to add it later, we can, but simpler is better... [~lfcnassif], would you be ok with that proposal? As for another jdbc-based format, I completely agree. Can you recommend another single-file db format? Access comes to mind, but I can't find a pure Java parser that has jdbc: Jackcess (LGPL) has its own api and doesn't support jdbc. I looked briefly at derby, hsqldb, mysql, and they all seem to rely on a directory of files...I very well could have missed a single file option for those, though... Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829 ] Tim Allison edited comment on TIKA-1511 at 1/26/15 1:52 PM: The RecursiveParserWrapper should allow that, no? With the caveat that it caches all output in memory... was (Author: talli...@mitre.org): The RecursiveParserWrapper should allow, that, no? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291829#comment-14291829 ] Tim Allison edited comment on TIKA-1511 at 1/26/15 2:36 PM: The RecursiveParserWrapper should allow that, no? With the caveat that it caches all output in memory... You should be able to parse the output from the standard recursive XHTML output as well. Right? If you have a chance (and if you haven't done so already), fork branch 1511 from my github site and take a look at the output of the test cases...throw in some print statements and see if that'll work. For testRecursiveParserWrapper(), change BasicContentHandlerFactory.HANDLER_TYPE.BODY to BasicContentHandlerFactory.HANDLER_TYPE.XML. was (Author: talli...@mitre.org): The RecursiveParserWrapper should allow that, no? With the caveat that it caches all output in memory... Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285621#comment-14285621 ] Luis Filipe Nassif edited comment on TIKA-1511 at 1/21/15 3:14 PM: --- No problems, the design looks good! was (Author: lfcnassif): No problems, the desing looks good! Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, TIKA-1511v3.patch, testSQLLite3b.db, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086 ] Luis Filipe Nassif edited comment on TIKA-1511 at 1/19/15 12:01 PM: If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case to extract each table to one file... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table boundaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be handled by an EmbeddedDocExtractor, if setted into parseContext. If not setted, the ContentHandlerDecorator do not need to split the xhtml output and can fallback to default behavior. Then A custom EDE can extract tables to files if desired. So now I think the big doc approah is not bad. What do you think? was (Author: lfcnassif): If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case like TikaCli --extract... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table boundaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be handled by an EmbeddedDocDecorator, if setted into parseContext. If not setted the ContentHandlerDecorator do not need to split tables and can fallback to default behavior. A custom EDE can then extract tables to files if desired. So now I think we could go with the big doc approah. What do you think? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281086#comment-14281086 ] Luis Filipe Nassif edited comment on TIKA-1511 at 1/18/15 2:09 PM: --- If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case like TikaCli --extract... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table boundaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be handled by an EmbeddedDocDecorator, if setted into parseContext. If not setted the ContentHandlerDecorator do not need to split tables and can fallback to default behavior. A custom EDE can then extract tables to files if desired. So now I think we could go with the big doc approah. What do you think? was (Author: lfcnassif): If the inputStream (pseudoInputStream) received by EmbeddedDocExtractor can not be read, I think using EDE is not useful. How will this approach work with TikaCli --extract option? My original idea was to support an use case like TikaCli --extract... Now I think this extraction of tables to files can be done handling the db as one big doc and using a ContentHandlerDecorator that will split the xhtml output at table bondaries. Each xhtml segment can be converted to a byte[] (if small) and then to a ByteArrayInputStream that can be passed to a EmbeddedDocDecorator, if set on parseContext. If not set the ContentHandlerDecorator do not need to split tables and can fallBack to default behavior. A custom EDE can then extract tables to files if desired. So now I think we could go with the big doc approah. What do you think? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280345#comment-14280345 ] Tim Allison edited comment on TIKA-1511 at 1/16/15 3:08 PM: First draft of patch attached. Need to build out tests, obviously, and I'll fix spelling of SQLLite in the class names! :) For the design, I created a public parser that called a new *DBParser class for each call to parse (like many other parsers) to avoid thread safety issues. The *DBParser, in turn, calls the EmbeddedDocumentExtractor for each table, and it specifies via special mime-type, which *TableParser will be called. The *TableParser ignores the empty InputStream, and grabs the StatementTablePair from the ParseContext to parse each table. Also, as part of the design, the EmbeddedDocumentExtractor is called for each BLOB and each CLOB. The jdbc wrapper around sqlite is not able to read CLOBs (apparently?), although I could write them without exception (doesn't mean they were actually written), and it does some other stuff that is not standard JDBC, but that is all handled in SQLiteTableParser, a subclass of AbstractTableParser. Any and all feedback is welcomed. This is still drafty. was (Author: talli...@mitre.org): First draft of patch attached. Need to build out tests, obviously, and I'll fix spelling of SQLLite in the class names! :) For the design, I had to create a public parser that called a new *DBParser class for each call to parse (like many other parsers) to avoid thread safety issues. The *DBParser, in turn, calls the EmbeddedDocumentParser for each table, and it specifies via special mime-type, which *TableParser will be called. The *TableParser ignores the InputStream, and grabs the StatementTablePair from the ParseContext to parse each table. The jdbc wrapper around sqlite is not able to read CLOBs (apparently?), although I could write them without exception (doesn't mean they were actually written), and it does some other stuff that is not standard JDBC, but that is all handled in SQLiteTableParser, a subclass of AbstractTableParser. Any and all feedback is welcomed. This is still drafty. Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 Attachments: TIKA-1511v1.patch, testSQLLite3b.db I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275534#comment-14275534 ] Tim Allison edited comment on TIKA-1511 at 1/13/15 5:03 PM: Looks like we're cleared via LEGAL-215 for xerial or anything else that wraps sqlite. We can add some language about the underlying sqlite non-license and we should be good to go. I think my preference for now would be to have an abstract base class (with at least these abstract methods: getConnection(), getTableNames(), addMetadata(Connection connection, Metadata metadata), close(Connection connection) that we can extend for each db parser. The abstract class would implement the select * from eachtable. This plan would only work for jdbc-compliant dependencies that can return a Connection. It appears that this plan would work for xerial but not for sqlite4java...that said, I'm not above writing a separate parser for db-specific calls as with sqlite4java's SQLiteConnection. :) I defer to the community on whether to go this route, the ManifoldCF route or another. As [~grossws] recommended, we can build the parsers and then do a check for whether or not the drivers are available. The user would be responsible for adding any non Apache licensable jars to their classpath. was (Author: talli...@mitre.org): Looks like we're cleared via LEGAL-215 for xerial. We can add some language about the underlying sqlite non-license and we should be good to go. I think my preference for now would be to have an abstract base class (with at least these abstract methods: getConnection(), getTableNames(), addMetadata(Connection connection, Metadata metadata), close(Connection connection) that we can extend for each db parser. But I defer to the community. As [~grossws] recommended, we can build the parsers and then do a check for whether or not the drivers are available. The user would be responsible for adding any non Apache licensable jars to their classpath. Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275534#comment-14275534 ] Tim Allison edited comment on TIKA-1511 at 1/13/15 5:04 PM: Looks like we're cleared via LEGAL-215 for xerial or anything else that wraps sqlite. We can add some language about the underlying sqlite non-license and we should be good to go. I think my preference for now would be to have an abstract base class (with at least these abstract methods: getConnection(), getTableNames(), addMetadata(Connection connection, Metadata metadata), close(Connection connection)) that we can extend for each db parser. The abstract class would implement the select * from eachtable. This plan would only work for jdbc-compliant-ish dependencies that can return a Connection. It appears that this plan would work for xerial but not for sqlite4java...that said, I'm not above writing a separate parser for db-specific calls as with sqlite4java's SQLiteConnection. :) I defer to the community on whether to go this route, the ManifoldCF route or another. As [~grossws] recommended, we can build the parsers and then do a check for whether or not the drivers are available. The user would be responsible for adding any non Apache licensable jars to their classpath. was (Author: talli...@mitre.org): Looks like we're cleared via LEGAL-215 for xerial or anything else that wraps sqlite. We can add some language about the underlying sqlite non-license and we should be good to go. I think my preference for now would be to have an abstract base class (with at least these abstract methods: getConnection(), getTableNames(), addMetadata(Connection connection, Metadata metadata), close(Connection connection) that we can extend for each db parser. The abstract class would implement the select * from eachtable. This plan would only work for jdbc-compliant dependencies that can return a Connection. It appears that this plan would work for xerial but not for sqlite4java...that said, I'm not above writing a separate parser for db-specific calls as with sqlite4java's SQLiteConnection. :) I defer to the community on whether to go this route, the ManifoldCF route or another. As [~grossws] recommended, we can build the parsers and then do a check for whether or not the drivers are available. The user would be responsible for adding any non Apache licensable jars to their classpath. Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1511) Create a parser for SQLite3
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275217#comment-14275217 ] Tim Allison edited comment on TIKA-1511 at 1/13/15 1:59 PM: Thank you, [~grossws]! Two questions: 1) On how to exclude the native libs...is it ok to require that people re-bundle, that is just get rid of the dependency in the pom and build from scratch? Is there a cleaner method? 2) Would it be better to require users who want SQLLite3 parsing to add xerial to their classpath?We'll probably need to do this for formats that don't have Apache friendly drivers (afaik: .mdb, .dbf , others?) was (Author: talli...@mitre.org): Thank you, [~grossws]! Two questions: 1) On how to exclude the native libs...is it ok to require that people re-bundle, that is just get rid of the dependency in the pom and build from scratch? Is there a cleaner method? 2) Would it be better to require users who want SQLLite3 parsing to add xerial to their classpath? Create a parser for SQLite3 --- Key: TIKA-1511 URL: https://issues.apache.org/jira/browse/TIKA-1511 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.6 Reporter: Luis Filipe Nassif Fix For: 1.8 I think it would be very useful, as sqlite is used as data storage by a wide range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)