Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Luís Filipe Nassif
Hi Nick, As commented on TIKA-2419, the original issue of eml/emlx being detected as html I fixed locally by increasing the magic priority of eml/emlx instead of decreasing html priority. Maybe that is an alternative to dropping the xml priority in the future, but it can impact other things too.

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Chris Mattmann
Totally agree, thank you Common Crawl for running Tika! On 7/5/17, 5:09 AM, "Allison, Timothy B." wrote: This is FANTASTIC!!! Thank you, Sebastian! I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ b

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Nick Burch
Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the HTML magic priority to fix for HTML-containing formats like email had broken some things. I'v

RE: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
Why, yes, please! JIRA with small samples would be fantastic. I think working in desc order of most common to least would be best...php, asp, coldfusion. I'm about to cut 1.16, but I look forward to improving Tika with this tremendously useful data. Again, many thanks! Cheers, Ti

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Sebastian Nagel
Hi Tim, thanks! Let me know if I should take any actions (e.g., open issue(s) on Jira) or whether I can help by compiling smaller test sets. Best, Sebastian On 07/05/2017 02:09 PM, Allison, Timothy B. wrote: > This is FANTASTIC!!! Thank you, Sebastian! > > I suspect that we should try to fix t

RE: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
This is FANTASTIC!!! Thank you, Sebastian! I suspect that we should try to fix these at the Tika level. We'll never be 100%, but most of the problems you describe _should_ be fixable. > If anyone is interested in using the detected MIME types or anything else > from Common Crawl - I'm happy