Hi Nick,
As commented on TIKA-2419, the original issue of eml/emlx being detected as
html I fixed locally by increasing the magic priority of eml/emlx instead
of decreasing html priority. Maybe that is an alternative to dropping the
xml priority in the future, but it can impact other things too.
Totally agree, thank you Common Crawl for running Tika!
On 7/5/17, 5:09 AM, "Allison, Timothy B." wrote:
This is FANTASTIC!!! Thank you, Sebastian!
I suspect that we should try to fix these at the Tika level. We'll never
be 100%, but most of the problems you describe _should_ b
Having taken a "quick" look over lunch at some of the "programming
language" ones, and gone down a rabbit whole... I think at least some of
them are as described in TIKA-2419, where our change to the HTML magic
priority to fix for HTML-containing formats like email had broken some
things.
I'v
Why, yes, please! JIRA with small samples would be fantastic. I think working
in desc order of most common to least would be best...php, asp, coldfusion.
I'm about to cut 1.16, but I look forward to improving Tika with this
tremendously useful data.
Again, many thanks!
Cheers,
Ti
Hi Tim,
thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira)
or whether I can help by compiling smaller test sets.
Best,
Sebastian
On 07/05/2017 02:09 PM, Allison, Timothy B. wrote:
> This is FANTASTIC!!! Thank you, Sebastian!
>
> I suspect that we should try to fix t
This is FANTASTIC!!! Thank you, Sebastian!
I suspect that we should try to fix these at the Tika level. We'll never be
100%, but most of the problems you describe _should_ be fixable.
> If anyone is interested in using the detected MIME types or anything else
> from Common Crawl - I'm happy