[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302653#comment-15302653
 ] 

Hudson commented on TIKA-1513:
--

UNSTABLE: Integrated in tika-2.x #103 (See 
[https://builds.apache.org/job/tika-2.x/103/])
TIKA-1513 -- update mime type according to Nick Burch's recommendation, 
(tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c)
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
* 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java


> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302219#comment-15302219
 ] 

Hudson commented on TIKA-1513:
--

FAILURE: Integrated in tika-2.x-windows #7 (See 
[https://builds.apache.org/job/tika-2.x-windows/7/])
TIKA-1513 -- update mime type according to Nick Burch's recommendation, 
(tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c)
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
* 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java


> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302183#comment-15302183
 ] 

Hudson commented on TIKA-1513:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #1001 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/1001/])
TIKA-1513 -- update mime type according to Nick Burch's recommendation, 
(tallison: rev dcaeccbab69519811e0cdf349873ce2b51e6ca10)
* tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java


> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302076#comment-15302076
 ] 

Tim Allison commented on TIKA-1513:
---

[~iryndin], would you mind if we added your test files (tir_im.dbf, gds_im.dbf, 
texto*) to our unit tests?

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-25 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300737#comment-15300737
 ] 

Nick Burch commented on TIKA-1513:
--

I haven't read much on the format, but I'd be tempted to maybe have that more 
like `application/x-dbf; vendor=FoxBASE; type=plus_with_memo`, or to have it 
more in keeping with the BDB / PE / DITA types, maybe `application/x-dbf; 
format=FoxBASE; type=plus_with_memo` or `application/x-dbf; 
format=plus_with_memo; vendor=FoxBASE` (depending on what the actual variances 
are)

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300460#comment-15300460
 ] 

Hudson commented on TIKA-1513:
--

FAILURE: Integrated in tika-2.x-windows #6 (See 
[https://builds.apache.org/job/tika-2.x-windows/6/])
TIKA-1513: add mime detection and parser for DBF files.  Thanks to Nick 
(tallison: rev 8d24e07fb1245de0e151e9ce3fd516651db1d989)
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java
* 
tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
* 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
* tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* CHANGES.txt
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFRow.java
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
* 
tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFCell.java
* tika-test-resources/src/test/resources/test-documents/testDBF_gb18030.dbf
* tika-test-resources/src/test/resources/test-documents/testDBF.dbf
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
* 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java


> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300425#comment-15300425
 ] 

Hudson commented on TIKA-1513:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #999 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/999/])
TIKA-1513 -- add mime detection and parsing for dbf files. Thanks to (tallison: 
rev e74f66375f20d914f8585597b6d9492586a0caa9)
* tika-parsers/src/test/resources/test-documents/testDBF_gb18030.dbf
* tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFCell.java
* tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
* 
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFRow.java
* tika-parsers/src/test/resources/test-documents/testDBF.dbf
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
* tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
TIKA-1513 -- add mime detection and parsing for dbf files. Thanks to (tallison: 
rev cb492f4b16ccdd0c0d8129f215e75a14f294cc89)
* CHANGES.txt


> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299884#comment-15299884
 ] 

Tim Allison commented on TIKA-1513:
---

[~nicholasc], do you, by chance, have any shareable examples of files that 
don't start with 0x03, e.g. Visual FoxPro, dBase IV, etc?  Any shareable 
examples of .dbt (memo) files?  Thank you, again for the mime-detection regex!

How do we want to handle detecting the variants?  

Option 1: replicate the above regex for each variant and change the first byte? 
 With parent mime-type "application/x-dbf"?
Option 2: send them all to the DBFParser, and that will update the mime type.

How do we want to represent the variants via the mime, e.g. 0x30 Visual FoxPro: 
"application/x-dbf; Visual FoxPro"





> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299033#comment-15299033
 ] 

Tim Allison commented on TIKA-1513:
---

Rolled our own parser. Will commit tomorrow.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-05-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280148#comment-15280148
 ] 

Tim Allison commented on TIKA-1513:
---

[~iryndin], now that 1.13 is in the voting process, I'd like to re-engage on 
this issue for 1.14.  Would you be willing to make the updates that 
[~nicholasc] recommended and push to maven central?  Or, again, as a far less 
preferable option, would you object to us incorporating your code within Tika?

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258147#comment-15258147
 ] 

Tim Allison commented on TIKA-1513:
---

Great.  Frankly, the initial regex looked quite good...small handful of false 
positives.  I look forward to running this on our corpus once 1.13 is released.

Once we get feedback from [~iryndin] on the parser, it'll be great to add 
detection and parsing in one go.

Thank you, again.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-25 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256870#comment-15256870
 ] 

Nick C commented on TIKA-1513:
--

Tested more files using the full regex and haven't had any false positives. :D

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.14
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-19 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249129#comment-15249129
 ] 

Nick C commented on TIKA-1513:
--

Sounds good. I'll be running this on more files this week and will report back 
if I notice any false positives. If you want you can make the field type check 
stricter which would possibly prevent other false positives (Replace 
\[A-Z\@\+\] with \[BCDFGILMNOPQTVWXY\@\+\])

||Details||Regex
|Enable dotall mode (so dots match new lines)|(?s)
|Signature/Version|^\[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB]
|Year (no check)|.
|Month (1-12)|\[\x01-\x0C]
|Day (1-31)|\[\x01-\x1F]
|Record count (uint32, no check)|.\{4}
|Header length (ushort) greater than 65|(.\[^\x00]\|\[\x41-\xFF].)
|Record length (ushort) greater than 1|(\[\^\x00\x01].\|.\[^\x00])
|Skip to first field header|.\{31}
|Make sure field name is null terminated (regex zero-width 
lookbehind)|(?<=\[\x00]\[^\x00]\{0,10})
|Field type|\[BCDFGILMNOPQTVWXY@+]

Full Regex
{code}
(?s)^[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB].[\x01-\x0C][\x01-\x1F].{4}(?:.[^\x00]|[\x41-\xFF].)(?:[^\x00\x01].|.[^\x00]).{31}(?<=[\x00][^\x00]{0,10})[BCDFGILMNOPQTVWXY@+]
{code}

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248493#comment-15248493
 ] 

Tim Allison commented on TIKA-1513:
---

I won't commit this until we get our corpus results back...perhaps I'll redo 
the run with this if there's time.

Coincidentally, on this 
[comparison|http://162.242.228.174/mimes/mime_comparisons.html], it looks like 
DROID is identifying ~3k files in our corpus as some version of dbase.

In your spare time, if you could document that work of art, that'd be handy.  
Thank you, again.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248486#comment-15248486
 ] 

Tim Allison commented on TIKA-1513:
---

I won't commit this until we get our corpus results back...perhaps I'll redo 
the run with this if there's time.

Coincidentally, on this 
[comparison|http://162.242.228.174/mimes/mime_comparisons.html], it looks like 
DROID is identifying ~3k files in our corpus as some version of dbase.

In your spare time, if you could document that work of art, that'd be handy.  
Thank you, again.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-19 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248463#comment-15248463
 ] 

Nick C commented on TIKA-1513:
--

I was running this on more data and ran in to a text file that matched. It 
started with a 2(\\x32) and 3 newlines. Had to make a small change that checks 
for a null byte before the field type (field names are null terminated)

{code:xml}



{code}

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247579#comment-15247579
 ] 

Tim Allison commented on TIKA-1513:
---

I'll add this before running the final (?) 1.13 regression tests and see what 
happens.  Thank you!

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-17 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245081#comment-15245081
 ] 

Nick C commented on TIKA-1513:
--

Did some more testing and simplified the rules enough that it could be made in 
to a regex. It's not pretty but works. It checks the signature/version, 
month(1-12), day(1-31), header length > 65, record length > 1, and first 
field's type (could be stricter)




> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240982#comment-15240982
 ] 

Tim Allison commented on TIKA-1513:
---

Nope.  Didn't remove them.  There are roughly 3k files that ended with dbf or 
dbase3 in govdocs1 and an earlier version of our slice of commoncrawl.
The files may not actually be dbfs, and they're likely truncated (at least 
those that came from commoncrawl).

Give [this|http://162.242.228.174/share/dbfs.tar.bz2] a shot.

Thank you, Rackspace! 

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239992#comment-15239992
 ] 

Tim Allison commented on TIKA-1513:
---

bq. At least 200. I would like more to test with though.

I think I rm'd the bz2 I shared with Ivan up above.  I'll see what I can dig up.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-13 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239960#comment-15239960
 ] 

Nick C commented on TIKA-1513:
--

bq. Well, you know there's still plenty of time to get that into Tika 2.0

Maybe I'll add that to my to do list. I have been wanting to work on improving 
the RTF parser to handle tables/html and generate valid xhtml (multiple lists 
seem to cause issues)

bq. Ballpark, how many dbfs do you have to dev with? Do you want some from our 
test corpus?

At least 200. I would like more to test with though.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239936#comment-15239936
 ] 

Tim Allison commented on TIKA-1513:
---

bq. It be nice if Tika's mime definition allowed for more complex matching like 
the linux magic db.

Well, you know there's still plenty of time to get that into Tika 2.0. :)

bq. I'll do some testing to see how far in the code the false positives I had 
stop matching and determine if I can make it simple enough to be a mime 
definition

Great.  Thank you, again.  Ballpark, how many dbfs do you have to dev with?  Do 
you want some from our test corpus?

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-13 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239836#comment-15239836
 ] 

Nick C commented on TIKA-1513:
--

I added the license header. I think some of the checks could be removed. I'll 
do some testing to see how far in the code the false positives I had stop 
matching and determine if I can make it simple enough to be a mime definition. 
It be nice if Tika's mime definition allowed for more complex matching like the 
linux magic db.

I also don't mind forking it into Tika or hosting it. A lot of the classes seem 
to be unused in jdbf v3 so it could be slimmed down to just a couple.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239067#comment-15239067
 ] 

Tim Allison commented on TIKA-1513:
---

Is there any interest in forking jdbf either into Tika; or [~nicholasc], do you 
have any interest in hosting it/pushing it to Maven?  I'd far, far prefer to 
update [~iryndin]'s code in place and avoid forking if necessary.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239062#comment-15239062
 ] 

Tim Allison commented on TIKA-1513:
---

[~gagravarr], would you mind taking a look at the detector?  Is there a way 
that we can convert this to a mime definition?  Or should we add a DBFDetector?

[~nicholasc], it looks great to me.  I agree that we'll probably want to relax 
some of the length checks (just make sure they're > 0 or something 
reasonable)...we wouldn't want this to fail on truncated dbfs, and as you've 
pointed out, there can be extra bytes at the end of the file.  If there's any 
way to avoid adding the dependency, that'd be great...although, I very much 
appreciate the concern for overflow!

In your experience, do we need to validate the fieldentry or can we stop 
sooner?  If we do, then I suspect there's no way to convert to a mime 
definition, but I suspect much of the earlier stuff could easily be translated.

Thank you!

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-11 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236171#comment-15236171
 ] 

Nick C commented on TIKA-1513:
--

Some of my checks maybe a little strict because you can have extra bytes at the 
end of file and after the field headers (I haven't personally seen any files 
like that though) I figure in those cases hopefully the file extension glob 
matches. I put some TODOs that can be changed to call jdbf for validating the 
DBF file type and field type. Feel free to do what you want with the code
https://gist.github.com/fxfixer/e54f86095a548cbfb8aeb948ff77a41b

I used the jdbf v3 branch and here are the bugs I noticed. If [~iryndin] is 
interested I'll create a pull request.
Calls to input.read(byte[]…) should use IOUtils.readFully. (Sometimes if the 
dbf is in a zip file, the read call returns less than the requested bytes)
DBFMetadataReader.readHeader()
- Needs to call IOUtils.readFully when reading headerBytes
- NPE if DbfFileTypeEnum.fromInt returns null (Maybe throw an 
unsupported exception?)
- Reads record count as int instead of unsigned int

DBFRecordIterator
- Unnecessary call to Arrays.fill to set byte[] bytes to 0 (Not really 
a bug)
- Needs to call IOUtils.readFully when reading recordBuffer;

Some encoding names are not correct in CharsetHelper.getCharsetByByte
936 = cp936 // Chinese (PRC, Singapore) Windows
932 = cp932 // Japanese Windows
1255 = Windows-1255 // Hebrew Windows
1256 = Windows-1256 // Arabic Windows
1250 = Windows-1250 // Eastern European Windows
1251 = Windows-1251 // Russian Windows
1254 = Windows-1254 // Turkish Windows
1253 = Windows-1253 // Greek Windows

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234986#comment-15234986
 ] 

Tim Allison commented on TIKA-1513:
---

[~iryndin], any interest in working on this?

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234984#comment-15234984
 ] 

Tim Allison commented on TIKA-1513:
---

Great.  Thank you!

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-10 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234276#comment-15234276
 ] 

Nick C commented on TIKA-1513:
--

I wrote the detector from scratch a couple months ago because 0x03 caused too 
many false positives. For the parser I ended up using jdbf but found some bugs. 
One was that the parser would error if inputStream.read(...) returned less than 
the number of required bytes (The code needs to use something like 
IOUtils.readFully)

The logic I used was
- Validate the signature
- Validate the header last update date (Is the month between 1 and 12 and is 
the day valid for that month)
- Validate the header size by dividing by 32 and making sure there aren’t more 
then 255 fields
- Calculate the file size using the record count, header length and record 
length from the header making sure its less than 4GB. If I can get the input 
stream length without reading the entire stream (TikaInputStream.hasLength or 
metadata.content_length) I make sure the calculated size matches (or is within 
2 bytes).

I'll put the code up on github tomorrow and get a list of the jdbf bugs.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-10 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234007#comment-15234007
 ] 

Nick Burch commented on TIKA-1513:
--

Is it based on JDBF, or did you write it from scratch?

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2016-04-09 Thread Nick C (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233743#comment-15233743
 ] 

Nick C commented on TIKA-1513:
--

I ended up building a detector that tries to validate the dbf header instead of 
just looking for 0x03 which caused false positives. If you're interested I'll 
submit a patch.

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-09-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734927#comment-14734927
 ] 

Tim Allison commented on TIKA-1513:
---

Hi [~iryndin], I wanted to check in to see if you've had a chance to make any 
progress on this.  I've let it go to the backburner for a bit.

Thank you!

> Add mime detection and parsing for dbf files
> 
>
> Key: TIKA-1513
> URL: https://issues.apache.org/jira/browse/TIKA-1513
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.11
>
>
> I just came across an Apache licensed dbf parser that is available on 
> [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508932#comment-14508932
 ] 

Tim Allison commented on TIKA-1513:
---

Oh, broken files, y, that would explain your concern.  And, y, that's pretty 
bad. 

Would you be able to run file against a handful of your false positives to 
see what file says those files are?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-22 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507995#comment-14507995
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

Hi Tim,

I've processed a forensic disk copy with 533,949 files. I got 137 files 
detected as application/x-dbf using the 0x03 signature, all false positives. 
Not so good. Many of them are deleted/recovered files pointing to binary data.

The reference you've posted (http://www.dbf2002.com/dbf-file-format.html) 
states that byte at offset 0x00 can have other values depending on file version 
or software vendor. And some of them are supported by jdbf. So I think 0x03 is 
also too restrictive.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505057#comment-14505057
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

No, I did not give a try to 0x03. How many files are detected as octet-stream 
in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am 
indexing ALL desktop files). I will test 0x03 and report the results here. Can 
we at least decrease the magic priority to 10 or 20 for now?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504996#comment-14504996
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

Hi Tim,

I am ok with 1) and 2). But I think an one byte magic can result in many false 
positives, specially binary files. My current approach is detection by 
extension only. That needed a declaration of text/plain as a supertype.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505092#comment-14505092
 ] 

Tim Allison commented on TIKA-1513:
---

Completely agree.  

Only 2,386 files.

This is the table of the file extensions for files identified as 
application/octet-stream.

||File Extension||Count||
|dbase3|1664|
|wp|362|
|unk|   285|
|gls|   60|
|ileaf| 4|
|sys|   3|
|chp|   2|
|lnk|   2|
|mac|   2|
|squeak|1|
|bin|   1|

Would very much appreciate what you find, and yes, we can certainly decrease 
the priority...I had my priorities backwards.  Sorry.

Obviously, if you find false positives, we'll back off to file suffix.  I, too, 
was less than enthusiastic about a single byte mime id'er.

Thank you!

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505006#comment-14505006
 ] 

Tim Allison commented on TIKA-1513:
---

Y, I was concerned by that generally.  Are you getting false positives with 
0x03 specifically?  I didn't find any in govdocs1, but I realize that corpus 
has limitations.

Will add text/plain as supertype.  Thank you!

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504951#comment-14504951
 ] 

Tim Allison commented on TIKA-1513:
---

From govdocs1, it looks like first byte of 0X03 is a safe way to identify 
these files.  

[This|http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml] was 
useful.

Two mime type questions:
1)  What should we use as the canonical mime type for .dbf files?  Proposal: 
{{application/x-dbf}}.

2)  What mimes should the parser accept, or what should we include in the 
aliases?
From [filext.com|http://filext.com/file-extension/DBF]:
* application/dbase
* application/x-dbase
* application/dbf
* application/x-dbf
* zz-application/zz-winassoc-dbf

First attempt at mime definition:
{noformat}
  mime-type type=application/x-dbf
magic priority=100
  match value=0x03 type=string offset=0/
/magic
glob pattern=*.dbf/
glob pattern=*.dbase/
  /mime-type
{noformat}

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506214#comment-14506214
 ] 

Tim Allison commented on TIKA-1513:
---

In looking at [this|http://www.dbf2002.com/dbf-file-format.html], I wonder if 
we could add 0x00 at 30 and 31?

I'm currently grepping the Common Crawl slice from Julien Nioche for files 
starting with 0x03, and I'm getting a vast majority .dbf, but there are some 
that end in .dct, .ndx (dbf index?), .tfm, .ctg...  Will report findings 
tomorrow.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-04-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502716#comment-14502716
 ] 

Tim Allison commented on TIKA-1513:
---

Hi [~iryndin], I wanted to check in to see how the cleanup/mavenizing is going. 
 Thank you! 

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.9


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293503#comment-14293503
 ] 

Tim Allison commented on TIKA-1513:
---

Ah, ok.  These are the links that I came across: [general 
structure|https://msdn.microsoft.com/en-us/library/aa975386(v=vs.71).aspx] 
(with mention of codepage mark at byte 29) and mappings for the [code page 
byte|https://msdn.microsoft.com/en-us/library/aa975345(v=vs.71).aspx] and 
[here|http://support.microsoft.com/kb/129631/en-us].  I realize that there is 
always a difference between specs and reality. :)

On charset detection, y, ngram naive bayes would be fun, but for now we'll use 
the built in charset detection that comes with Tika.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-27 Thread Ivan Ryndin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293842#comment-14293842
 ] 

Ivan Ryndin commented on TIKA-1513:
---

Yeah, I saw these articles. Probably, this code page byte exists only in files 
produced with Visual FoxPro only. 
I haven't met this byte different from 0x00 in DBF files I work with. 

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-26 Thread Ivan Ryndin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292929#comment-14292929
 ] 

Ivan Ryndin commented on TIKA-1513:
---

There are no reliable ways to detect codepage of DBF files. I haven't met DBF 
specs where codepage is somehow specified with some special byte.
The only way to determine codepage is trial and error.
---
Possibly there can be one interesting approach to detect codepage similar to 
that used in language detection. This is statistics based approach. I mean 
n-gram based language detection methods. I haven't met any ready-to-use 
framework to detect codepage this way. However, not sure it is worth 
implementing.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287980#comment-14287980
 ] 

Tim Allison commented on TIKA-1513:
---

[~iryndin], on codepage detection in dbf...in one of the specs I read, it looks 
like there is a byte in the header that may or may be set that specifies the 
codepage for the table.  Are you, by chance, parsing that?

If we wanted to integrate our charset detector, would we call getBytes() on the 
first X DbfRecords, run those through our detector and then reprocess the 
stream with that charset?

I installed OpenOffice so that I could create test dbf documents, but the 
results have been pretty poor.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-16 Thread Ivan Ryndin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280211#comment-14280211
 ] 

Ivan Ryndin commented on TIKA-1513:
---

Hi guys! 
I started working on jdbf push to Maven Central. 
I think this will take 1-2 weeks for me - review code once more, create 
javadocs, update POM file according to instructions. 
I'll drop a note here when it will be ready. 

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280367#comment-14280367
 ] 

Tim Allison commented on TIKA-1513:
---

[~iryndin], No rush on our side (well, at least mine). :)  I look forward to 
testing jdbf and potentially integrating it into Tika.  What's your level of 
interest in ongoing support?  Thank you for pushing it into the public Maven 
repo!

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-16 Thread Ivan Ryndin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280391#comment-14280391
 ] 

Ivan Ryndin commented on TIKA-1513:
---

Well, I plan ongoing support of JDBF, though I left the project which it was 
done for (linux-hosted java webapp where there was need to read/write DBFs as 
one of exchange formats).

What do you mean by rushing onto your side? ;-) Do you invite me to work on 
some TIKA issues? 



 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280398#comment-14280398
 ] 

Tim Allison commented on TIKA-1513:
---

Great!  Well, yes, we're always looking to build the community.  Please join 
on!  

What I meant, though, was that I probably won't get to this for few weeks 
myself, so your estimate of 1-2 weeks is great.

Thank you, again!

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-16 Thread Ivan Ryndin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280401#comment-14280401
 ] 

Ivan Ryndin commented on TIKA-1513:
---

Well, okay, let my first job for the TIKA project will be pushing JDBF to Maven 
Central, and then let's discuss my further steps. 

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279008#comment-14279008
 ] 

Tim Allison commented on TIKA-1513:
---

Thank you, [~lfcnassif] and [~gagravarr]!

I think I'll work on the sqlite parser integration first and then turn to 
this...maybe this will be in maven by then? :)

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278678#comment-14278678
 ] 

Nick Burch commented on TIKA-1513:
--

If it's the project themselves pushing it to central, then the docs to follow 
are http://central.sonatype.org/pages/ossrh-guide.html and 
http://central.sonatype.org/pages/apache-maven.html#performing-a-release-deployment-with-the-maven-release-plugin

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-15 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278674#comment-14278674
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

I talked to iryndin and he liked the idea to push jdbf to maven central. Can 
someone with experience on that help him?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276916#comment-14276916
 ] 

Tim Allison commented on TIKA-1513:
---

From a brochure-level evaluation :), I'd prefer jdbf.  If we want to carry out 
an evaluation on the 2600 govdocs1 files, we'll have to implement wrappers for 
both.  I propose the following:

1) I'll build a parser with jamel.  If basic functional tests look decent on 
the 2600 files from govdocs1, I'll commit that to Tika, and we'll have basic 
mavenized dbf support.

2) After that I'll build a parser with jdbf and we can compare output on the 
govdocs1 files.  If jdbf results are equal or better, we can try to persuade 
iryndin to push to maven.

??

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275553#comment-14275553
 ] 

Tim Allison commented on TIKA-1513:
---

Any interest in encouraging iryndin to push to maven?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275636#comment-14275636
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

I can if the community thinks that jdbf is a better option.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275542#comment-14275542
 ] 

Luis Filipe Nassif commented on TIKA-1513:
--

I have found https://github.com/iryndin/jdbf. Seems to be more active and to 
support more field types (memo, picture, etc) and more dbf formats.
Its pom file declares Apache v2 license. But I could not find it on maven.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275679#comment-14275679
 ] 

Konstantin Gribov commented on TIKA-1513:
-

[~talli...@mitre.org], I think it's good idea, even if Tika won't use it as 
dependency.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-13 Thread Konstantin Gribov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275467#comment-14275467
 ] 

Konstantin Gribov commented on TIKA-1513:
-

Is this lib alive? Last commits were in mid 2014, some issues are from late 
2013, PRs are from mid 2014.

At least extensive testing is needed. Have we any freely accessible dbfs to use 
them in tests?

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files

2015-01-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275472#comment-14275472
 ] 

Tim Allison commented on TIKA-1513:
---

I share your concern.  There are ~2600 .dbase3 files in govdocs1.

 Add mime detection and parsing for dbf files
 

 Key: TIKA-1513
 URL: https://issues.apache.org/jira/browse/TIKA-1513
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 I just came across an Apache licensed dbf parser that is available on 
 [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
 Let's add dbf parsing to Tika.
 Any other recommendations for alternate parsers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)