Hi,
a follow up based on Tika 1.16 for the July crawl:
# Tika-1.16 HTTP-Content-Type
4580525 text/x-php text/html
842698 text/x-coldfusion text/html
579128 text/asptext/html
510323
>which have a pretty heavy/messy dependency tree
You've seem our pom, right? We have you covered!
...
*
*
*
...
From: Jackson, Andy [mailto:andrew.jack...@bl.uk]
Sent: Friday, July 7, 2017 7:19 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote&
Should we add a WARC parser? ☺
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
Sent: Friday, July 7, 2017 3:43 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content
Is anyone aware of a tool to run Tika on a WARC file? Everythin
s!
>>
>> Cheers,
>>
>> Tim
>>
>> -----Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detect
Tim
>>
>> -----Original Message-
>> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>> Sent: Wednesday, July 5, 2017 9:03 AM
>> To: user@tika.apache.org
>> Subject: Re: Tika content detection and crawled "remote" content
>>
>&
t: Tuesday, July 4, 2017 6:18 AM
To: user@tika.apache.org
Subject: Tika content detection and crawled "remote" content
Hi,
recently I've plugged in Tika's content detection into Common Crawl's
crawler (modified Nutch) with the target to get clean and correct
l data.
Again, many thanks!
Cheers,
Tim
-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
Sent: Wednesday, July 5, 2017 9:03 AM
To: user@tika.apache.org
Subject: Re: Tika content detection and crawled "remote" content
Hi Tim,
thanks! Le
;
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: user@tika.apache.org
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tik
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler
(modified Nutch) with
the target to get clean and correct MIME type - the HTTP Content-Type may
contain garbage and isn't
always correct [1].
For the June 2017 crawl I've prepared a comparison of content types