[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302653#comment-15302653 ] Hudson commented on TIKA-1513: -- UNSTABLE: Integrated in tika-2.x #103 (See [https://builds.apache.org/job/tika-2.x/103/]) TIKA-1513 -- update mime type according to Nick Burch's recommendation, (tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c) * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java * tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302219#comment-15302219 ] Hudson commented on TIKA-1513: -- FAILURE: Integrated in tika-2.x-windows #7 (See [https://builds.apache.org/job/tika-2.x-windows/7/]) TIKA-1513 -- update mime type according to Nick Burch's recommendation, (tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c) * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java * tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302183#comment-15302183 ] Hudson commented on TIKA-1513: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #1001 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/1001/]) TIKA-1513 -- update mime type according to Nick Burch's recommendation, (tallison: rev dcaeccbab69519811e0cdf349873ce2b51e6ca10) * tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302076#comment-15302076 ] Tim Allison commented on TIKA-1513: --- [~iryndin], would you mind if we added your test files (tir_im.dbf, gds_im.dbf, texto*) to our unit tests? > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300737#comment-15300737 ] Nick Burch commented on TIKA-1513: -- I haven't read much on the format, but I'd be tempted to maybe have that more like `application/x-dbf; vendor=FoxBASE; type=plus_with_memo`, or to have it more in keeping with the BDB / PE / DITA types, maybe `application/x-dbf; format=FoxBASE; type=plus_with_memo` or `application/x-dbf; format=plus_with_memo; vendor=FoxBASE` (depending on what the actual variances are) > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300460#comment-15300460 ] Hudson commented on TIKA-1513: -- FAILURE: Integrated in tika-2.x-windows #6 (See [https://builds.apache.org/job/tika-2.x-windows/6/]) TIKA-1513: add mime detection and parser for DBF files. Thanks to Nick (tallison: rev 8d24e07fb1245de0e151e9ce3fd516651db1d989) * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java * tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java * tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java * tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * CHANGES.txt * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFRow.java * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java * tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFCell.java * tika-test-resources/src/test/resources/test-documents/testDBF_gb18030.dbf * tika-test-resources/src/test/resources/test-documents/testDBF.dbf * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java * tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300425#comment-15300425 ] Hudson commented on TIKA-1513: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #999 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/999/]) TIKA-1513 -- add mime detection and parsing for dbf files. Thanks to (tallison: rev e74f66375f20d914f8585597b6d9492586a0caa9) * tika-parsers/src/test/resources/test-documents/testDBF_gb18030.dbf * tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFCell.java * tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java * tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFRow.java * tika-parsers/src/test/resources/test-documents/testDBF.dbf * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java * tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java TIKA-1513 -- add mime detection and parsing for dbf files. Thanks to (tallison: rev cb492f4b16ccdd0c0d8129f215e75a14f294cc89) * CHANGES.txt > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 2.0, 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299884#comment-15299884 ] Tim Allison commented on TIKA-1513: --- [~nicholasc], do you, by chance, have any shareable examples of files that don't start with 0x03, e.g. Visual FoxPro, dBase IV, etc? Any shareable examples of .dbt (memo) files? Thank you, again for the mime-detection regex! How do we want to handle detecting the variants? Option 1: replicate the above regex for each variant and change the first byte? With parent mime-type "application/x-dbf"? Option 2: send them all to the DBFParser, and that will update the mime type. How do we want to represent the variants via the mime, e.g. 0x30 Visual FoxPro: "application/x-dbf; Visual FoxPro" > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299033#comment-15299033 ] Tim Allison commented on TIKA-1513: --- Rolled our own parser. Will commit tomorrow. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280148#comment-15280148 ] Tim Allison commented on TIKA-1513: --- [~iryndin], now that 1.13 is in the voting process, I'd like to re-engage on this issue for 1.14. Would you be willing to make the updates that [~nicholasc] recommended and push to maven central? Or, again, as a far less preferable option, would you object to us incorporating your code within Tika? > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258147#comment-15258147 ] Tim Allison commented on TIKA-1513: --- Great. Frankly, the initial regex looked quite good...small handful of false positives. I look forward to running this on our corpus once 1.13 is released. Once we get feedback from [~iryndin] on the parser, it'll be great to add detection and parsing in one go. Thank you, again. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256870#comment-15256870 ] Nick C commented on TIKA-1513: -- Tested more files using the full regex and haven't had any false positives. :D > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.14 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249129#comment-15249129 ] Nick C commented on TIKA-1513: -- Sounds good. I'll be running this on more files this week and will report back if I notice any false positives. If you want you can make the field type check stricter which would possibly prevent other false positives (Replace \[A-Z\@\+\] with \[BCDFGILMNOPQTVWXY\@\+\]) ||Details||Regex |Enable dotall mode (so dots match new lines)|(?s) |Signature/Version|^\[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB] |Year (no check)|. |Month (1-12)|\[\x01-\x0C] |Day (1-31)|\[\x01-\x1F] |Record count (uint32, no check)|.\{4} |Header length (ushort) greater than 65|(.\[^\x00]\|\[\x41-\xFF].) |Record length (ushort) greater than 1|(\[\^\x00\x01].\|.\[^\x00]) |Skip to first field header|.\{31} |Make sure field name is null terminated (regex zero-width lookbehind)|(?<=\[\x00]\[^\x00]\{0,10}) |Field type|\[BCDFGILMNOPQTVWXY@+] Full Regex {code} (?s)^[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB].[\x01-\x0C][\x01-\x1F].{4}(?:.[^\x00]|[\x41-\xFF].)(?:[^\x00\x01].|.[^\x00]).{31}(?<=[\x00][^\x00]{0,10})[BCDFGILMNOPQTVWXY@+] {code} > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248493#comment-15248493 ] Tim Allison commented on TIKA-1513: --- I won't commit this until we get our corpus results back...perhaps I'll redo the run with this if there's time. Coincidentally, on this [comparison|http://162.242.228.174/mimes/mime_comparisons.html], it looks like DROID is identifying ~3k files in our corpus as some version of dbase. In your spare time, if you could document that work of art, that'd be handy. Thank you, again. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248486#comment-15248486 ] Tim Allison commented on TIKA-1513: --- I won't commit this until we get our corpus results back...perhaps I'll redo the run with this if there's time. Coincidentally, on this [comparison|http://162.242.228.174/mimes/mime_comparisons.html], it looks like DROID is identifying ~3k files in our corpus as some version of dbase. In your spare time, if you could document that work of art, that'd be handy. Thank you, again. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248463#comment-15248463 ] Nick C commented on TIKA-1513: -- I was running this on more data and ran in to a text file that matched. It started with a 2(\\x32) and 3 newlines. Had to make a small change that checks for a null byte before the field type (field names are null terminated) {code:xml} {code} > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247579#comment-15247579 ] Tim Allison commented on TIKA-1513: --- I'll add this before running the final (?) 1.13 regression tests and see what happens. Thank you! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245081#comment-15245081 ] Nick C commented on TIKA-1513: -- Did some more testing and simplified the rules enough that it could be made in to a regex. It's not pretty but works. It checks the signature/version, month(1-12), day(1-31), header length > 65, record length > 1, and first field's type (could be stricter) > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240982#comment-15240982 ] Tim Allison commented on TIKA-1513: --- Nope. Didn't remove them. There are roughly 3k files that ended with dbf or dbase3 in govdocs1 and an earlier version of our slice of commoncrawl. The files may not actually be dbfs, and they're likely truncated (at least those that came from commoncrawl). Give [this|http://162.242.228.174/share/dbfs.tar.bz2] a shot. Thank you, Rackspace! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239992#comment-15239992 ] Tim Allison commented on TIKA-1513: --- bq. At least 200. I would like more to test with though. I think I rm'd the bz2 I shared with Ivan up above. I'll see what I can dig up. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239960#comment-15239960 ] Nick C commented on TIKA-1513: -- bq. Well, you know there's still plenty of time to get that into Tika 2.0 Maybe I'll add that to my to do list. I have been wanting to work on improving the RTF parser to handle tables/html and generate valid xhtml (multiple lists seem to cause issues) bq. Ballpark, how many dbfs do you have to dev with? Do you want some from our test corpus? At least 200. I would like more to test with though. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239936#comment-15239936 ] Tim Allison commented on TIKA-1513: --- bq. It be nice if Tika's mime definition allowed for more complex matching like the linux magic db. Well, you know there's still plenty of time to get that into Tika 2.0. :) bq. I'll do some testing to see how far in the code the false positives I had stop matching and determine if I can make it simple enough to be a mime definition Great. Thank you, again. Ballpark, how many dbfs do you have to dev with? Do you want some from our test corpus? > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239836#comment-15239836 ] Nick C commented on TIKA-1513: -- I added the license header. I think some of the checks could be removed. I'll do some testing to see how far in the code the false positives I had stop matching and determine if I can make it simple enough to be a mime definition. It be nice if Tika's mime definition allowed for more complex matching like the linux magic db. I also don't mind forking it into Tika or hosting it. A lot of the classes seem to be unused in jdbf v3 so it could be slimmed down to just a couple. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239067#comment-15239067 ] Tim Allison commented on TIKA-1513: --- Is there any interest in forking jdbf either into Tika; or [~nicholasc], do you have any interest in hosting it/pushing it to Maven? I'd far, far prefer to update [~iryndin]'s code in place and avoid forking if necessary. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239062#comment-15239062 ] Tim Allison commented on TIKA-1513: --- [~gagravarr], would you mind taking a look at the detector? Is there a way that we can convert this to a mime definition? Or should we add a DBFDetector? [~nicholasc], it looks great to me. I agree that we'll probably want to relax some of the length checks (just make sure they're > 0 or something reasonable)...we wouldn't want this to fail on truncated dbfs, and as you've pointed out, there can be extra bytes at the end of the file. If there's any way to avoid adding the dependency, that'd be great...although, I very much appreciate the concern for overflow! In your experience, do we need to validate the fieldentry or can we stop sooner? If we do, then I suspect there's no way to convert to a mime definition, but I suspect much of the earlier stuff could easily be translated. Thank you! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236171#comment-15236171 ] Nick C commented on TIKA-1513: -- Some of my checks maybe a little strict because you can have extra bytes at the end of file and after the field headers (I haven't personally seen any files like that though) I figure in those cases hopefully the file extension glob matches. I put some TODOs that can be changed to call jdbf for validating the DBF file type and field type. Feel free to do what you want with the code https://gist.github.com/fxfixer/e54f86095a548cbfb8aeb948ff77a41b I used the jdbf v3 branch and here are the bugs I noticed. If [~iryndin] is interested I'll create a pull request. Calls to input.read(byte[]…) should use IOUtils.readFully. (Sometimes if the dbf is in a zip file, the read call returns less than the requested bytes) DBFMetadataReader.readHeader() - Needs to call IOUtils.readFully when reading headerBytes - NPE if DbfFileTypeEnum.fromInt returns null (Maybe throw an unsupported exception?) - Reads record count as int instead of unsigned int DBFRecordIterator - Unnecessary call to Arrays.fill to set byte[] bytes to 0 (Not really a bug) - Needs to call IOUtils.readFully when reading recordBuffer; Some encoding names are not correct in CharsetHelper.getCharsetByByte 936 = cp936 // Chinese (PRC, Singapore) Windows 932 = cp932 // Japanese Windows 1255 = Windows-1255 // Hebrew Windows 1256 = Windows-1256 // Arabic Windows 1250 = Windows-1250 // Eastern European Windows 1251 = Windows-1251 // Russian Windows 1254 = Windows-1254 // Turkish Windows 1253 = Windows-1253 // Greek Windows > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234986#comment-15234986 ] Tim Allison commented on TIKA-1513: --- [~iryndin], any interest in working on this? > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234984#comment-15234984 ] Tim Allison commented on TIKA-1513: --- Great. Thank you! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234276#comment-15234276 ] Nick C commented on TIKA-1513: -- I wrote the detector from scratch a couple months ago because 0x03 caused too many false positives. For the parser I ended up using jdbf but found some bugs. One was that the parser would error if inputStream.read(...) returned less than the number of required bytes (The code needs to use something like IOUtils.readFully) The logic I used was - Validate the signature - Validate the header last update date (Is the month between 1 and 12 and is the day valid for that month) - Validate the header size by dividing by 32 and making sure there aren’t more then 255 fields - Calculate the file size using the record count, header length and record length from the header making sure its less than 4GB. If I can get the input stream length without reading the entire stream (TikaInputStream.hasLength or metadata.content_length) I make sure the calculated size matches (or is within 2 bytes). I'll put the code up on github tomorrow and get a list of the jdbf bugs. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234007#comment-15234007 ] Nick Burch commented on TIKA-1513: -- Is it based on JDBF, or did you write it from scratch? > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233743#comment-15233743 ] Nick C commented on TIKA-1513: -- I ended up building a detector that tries to validate the dbf header instead of just looking for 0x03 which caused false positives. If you're interested I'll submit a patch. > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734927#comment-14734927 ] Tim Allison commented on TIKA-1513: --- Hi [~iryndin], I wanted to check in to see if you've had a chance to make any progress on this. I've let it go to the backburner for a bit. Thank you! > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.11 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508932#comment-14508932 ] Tim Allison commented on TIKA-1513: --- Oh, broken files, y, that would explain your concern. And, y, that's pretty bad. Would you be able to run file against a handful of your false positives to see what file says those files are? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507995#comment-14507995 ] Luis Filipe Nassif commented on TIKA-1513: -- Hi Tim, I've processed a forensic disk copy with 533,949 files. I got 137 files detected as application/x-dbf using the 0x03 signature, all false positives. Not so good. Many of them are deleted/recovered files pointing to binary data. The reference you've posted (http://www.dbf2002.com/dbf-file-format.html) states that byte at offset 0x00 can have other values depending on file version or software vendor. And some of them are supported by jdbf. So I think 0x03 is also too restrictive. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505057#comment-14505057 ] Luis Filipe Nassif commented on TIKA-1513: -- No, I did not give a try to 0x03. How many files are detected as octet-stream in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am indexing ALL desktop files). I will test 0x03 and report the results here. Can we at least decrease the magic priority to 10 or 20 for now? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504996#comment-14504996 ] Luis Filipe Nassif commented on TIKA-1513: -- Hi Tim, I am ok with 1) and 2). But I think an one byte magic can result in many false positives, specially binary files. My current approach is detection by extension only. That needed a declaration of text/plain as a supertype. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505092#comment-14505092 ] Tim Allison commented on TIKA-1513: --- Completely agree. Only 2,386 files. This is the table of the file extensions for files identified as application/octet-stream. ||File Extension||Count|| |dbase3|1664| |wp|362| |unk| 285| |gls| 60| |ileaf| 4| |sys| 3| |chp| 2| |lnk| 2| |mac| 2| |squeak|1| |bin| 1| Would very much appreciate what you find, and yes, we can certainly decrease the priority...I had my priorities backwards. Sorry. Obviously, if you find false positives, we'll back off to file suffix. I, too, was less than enthusiastic about a single byte mime id'er. Thank you! Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505006#comment-14505006 ] Tim Allison commented on TIKA-1513: --- Y, I was concerned by that generally. Are you getting false positives with 0x03 specifically? I didn't find any in govdocs1, but I realize that corpus has limitations. Will add text/plain as supertype. Thank you! Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504951#comment-14504951 ] Tim Allison commented on TIKA-1513: --- From govdocs1, it looks like first byte of 0X03 is a safe way to identify these files. [This|http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml] was useful. Two mime type questions: 1) What should we use as the canonical mime type for .dbf files? Proposal: {{application/x-dbf}}. 2) What mimes should the parser accept, or what should we include in the aliases? From [filext.com|http://filext.com/file-extension/DBF]: * application/dbase * application/x-dbase * application/dbf * application/x-dbf * zz-application/zz-winassoc-dbf First attempt at mime definition: {noformat} mime-type type=application/x-dbf magic priority=100 match value=0x03 type=string offset=0/ /magic glob pattern=*.dbf/ glob pattern=*.dbase/ /mime-type {noformat} Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506214#comment-14506214 ] Tim Allison commented on TIKA-1513: --- In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if we could add 0x00 at 30 and 31? I'm currently grepping the Common Crawl slice from Julien Nioche for files starting with 0x03, and I'm getting a vast majority .dbf, but there are some that end in .dct, .ndx (dbf index?), .tfm, .ctg... Will report findings tomorrow. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502716#comment-14502716 ] Tim Allison commented on TIKA-1513: --- Hi [~iryndin], I wanted to check in to see how the cleanup/mavenizing is going. Thank you! Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.9 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293503#comment-14293503 ] Tim Allison commented on TIKA-1513: --- Ah, ok. These are the links that I came across: [general structure|https://msdn.microsoft.com/en-us/library/aa975386(v=vs.71).aspx] (with mention of codepage mark at byte 29) and mappings for the [code page byte|https://msdn.microsoft.com/en-us/library/aa975345(v=vs.71).aspx] and [here|http://support.microsoft.com/kb/129631/en-us]. I realize that there is always a difference between specs and reality. :) On charset detection, y, ngram naive bayes would be fun, but for now we'll use the built in charset detection that comes with Tika. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293842#comment-14293842 ] Ivan Ryndin commented on TIKA-1513: --- Yeah, I saw these articles. Probably, this code page byte exists only in files produced with Visual FoxPro only. I haven't met this byte different from 0x00 in DBF files I work with. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292929#comment-14292929 ] Ivan Ryndin commented on TIKA-1513: --- There are no reliable ways to detect codepage of DBF files. I haven't met DBF specs where codepage is somehow specified with some special byte. The only way to determine codepage is trial and error. --- Possibly there can be one interesting approach to detect codepage similar to that used in language detection. This is statistics based approach. I mean n-gram based language detection methods. I haven't met any ready-to-use framework to detect codepage this way. However, not sure it is worth implementing. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287980#comment-14287980 ] Tim Allison commented on TIKA-1513: --- [~iryndin], on codepage detection in dbf...in one of the specs I read, it looks like there is a byte in the header that may or may be set that specifies the codepage for the table. Are you, by chance, parsing that? If we wanted to integrate our charset detector, would we call getBytes() on the first X DbfRecords, run those through our detector and then reprocess the stream with that charset? I installed OpenOffice so that I could create test dbf documents, but the results have been pretty poor. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280211#comment-14280211 ] Ivan Ryndin commented on TIKA-1513: --- Hi guys! I started working on jdbf push to Maven Central. I think this will take 1-2 weeks for me - review code once more, create javadocs, update POM file according to instructions. I'll drop a note here when it will be ready. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280367#comment-14280367 ] Tim Allison commented on TIKA-1513: --- [~iryndin], No rush on our side (well, at least mine). :) I look forward to testing jdbf and potentially integrating it into Tika. What's your level of interest in ongoing support? Thank you for pushing it into the public Maven repo! Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280391#comment-14280391 ] Ivan Ryndin commented on TIKA-1513: --- Well, I plan ongoing support of JDBF, though I left the project which it was done for (linux-hosted java webapp where there was need to read/write DBFs as one of exchange formats). What do you mean by rushing onto your side? ;-) Do you invite me to work on some TIKA issues? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280398#comment-14280398 ] Tim Allison commented on TIKA-1513: --- Great! Well, yes, we're always looking to build the community. Please join on! What I meant, though, was that I probably won't get to this for few weeks myself, so your estimate of 1-2 weeks is great. Thank you, again! Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280401#comment-14280401 ] Ivan Ryndin commented on TIKA-1513: --- Well, okay, let my first job for the TIKA project will be pushing JDBF to Maven Central, and then let's discuss my further steps. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279008#comment-14279008 ] Tim Allison commented on TIKA-1513: --- Thank you, [~lfcnassif] and [~gagravarr]! I think I'll work on the sqlite parser integration first and then turn to this...maybe this will be in maven by then? :) Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278678#comment-14278678 ] Nick Burch commented on TIKA-1513: -- If it's the project themselves pushing it to central, then the docs to follow are http://central.sonatype.org/pages/ossrh-guide.html and http://central.sonatype.org/pages/apache-maven.html#performing-a-release-deployment-with-the-maven-release-plugin Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278674#comment-14278674 ] Luis Filipe Nassif commented on TIKA-1513: -- I talked to iryndin and he liked the idea to push jdbf to maven central. Can someone with experience on that help him? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276916#comment-14276916 ] Tim Allison commented on TIKA-1513: --- From a brochure-level evaluation :), I'd prefer jdbf. If we want to carry out an evaluation on the 2600 govdocs1 files, we'll have to implement wrappers for both. I propose the following: 1) I'll build a parser with jamel. If basic functional tests look decent on the 2600 files from govdocs1, I'll commit that to Tika, and we'll have basic mavenized dbf support. 2) After that I'll build a parser with jdbf and we can compare output on the govdocs1 files. If jdbf results are equal or better, we can try to persuade iryndin to push to maven. ?? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275553#comment-14275553 ] Tim Allison commented on TIKA-1513: --- Any interest in encouraging iryndin to push to maven? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275636#comment-14275636 ] Luis Filipe Nassif commented on TIKA-1513: -- I can if the community thinks that jdbf is a better option. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275542#comment-14275542 ] Luis Filipe Nassif commented on TIKA-1513: -- I have found https://github.com/iryndin/jdbf. Seems to be more active and to support more field types (memo, picture, etc) and more dbf formats. Its pom file declares Apache v2 license. But I could not find it on maven. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275679#comment-14275679 ] Konstantin Gribov commented on TIKA-1513: - [~talli...@mitre.org], I think it's good idea, even if Tika won't use it as dependency. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275467#comment-14275467 ] Konstantin Gribov commented on TIKA-1513: - Is this lib alive? Last commits were in mid 2014, some issues are from late 2013, PRs are from mid 2014. At least extensive testing is needed. Have we any freely accessible dbfs to use them in tests? Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275472#comment-14275472 ] Tim Allison commented on TIKA-1513: --- I share your concern. There are ~2600 .dbase3 files in govdocs1. Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)