Re: Tika content detection and crawled "remote" content

2017-08-10 Thread Sebastian Nagel
Hi, a follow up based on Tika 1.16 for the July crawl: # Tika-1.16 HTTP-Content-Type 4580525 text/x-php text/html 842698 text/x-coldfusion text/html 579128 text/asptext/html 510323

RE: Tika content detection and crawled "remote" content

2017-07-07 Thread Allison, Timothy B.
>which have a pretty heavy/messy dependency tree You've seem our pom, right? We have you covered! ... * * * ... From: Jackson, Andy [mailto:andrew.jack...@bl.uk] Sent: Friday, July 7, 2017 7:19 AM To: user@tika.apache.org Subject: Re: Tika content detection and crawled "remote&

RE: Tika content detection and crawled "remote" content

2017-07-07 Thread Allison, Timothy B.
Should we add a WARC parser? ☺ From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Friday, July 7, 2017 3:43 AM To: user@tika.apache.org Subject: Re: Tika content detection and crawled "remote" content Is anyone aware of a tool to run Tika on a WARC file? Everythin

Re: Tika content detection and crawled "remote" content

2017-07-06 Thread Sebastian Nagel
s! >> >> Cheers, >> >> Tim >> >> -----Original Message- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: Wednesday, July 5, 2017 9:03 AM >> To: user@tika.apache.org >> Subject: Re: Tika content detect

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Luís Filipe Nassif
Tim >> >> -----Original Message- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: Wednesday, July 5, 2017 9:03 AM >> To: user@tika.apache.org >> Subject: Re: Tika content detection and crawled "remote" content >> >&

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Chris Mattmann
t: Tuesday, July 4, 2017 6:18 AM To: user@tika.apache.org Subject: Tika content detection and crawled "remote" content Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Nick Burch
l data. Again, many thanks! Cheers, Tim -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Wednesday, July 5, 2017 9:03 AM To: user@tika.apache.org Subject: Re: Tika content detection and crawled "remote" content Hi Tim, thanks! Le

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Sebastian Nagel
; > -Original Message- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Tuesday, July 4, 2017 6:18 AM > To: user@tika.apache.org > Subject: Tika content detection and crawled "remote" content > > Hi, > > recently I've plugged in Tik

Tika content detection and crawled "remote" content

2017-07-04 Thread Sebastian Nagel
Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1]. For the June 2017 crawl I've prepared a comparison of content types