I hadn't the time to dig into the problem.
Neither how to pass a tika-config file nor why
actually parse-html is detecting the encoding
although it's also only looking for the first 8192
characters (see CHUNK_SIZE).

Just one point: for the MIME detection we also
pass the Content-Type sent by the web server to Tika.
Could this also be help to pass it as additional glue?
In the concrete example the server sends
  Content-Type: text/html; charset=utf-8

Sebastian

On 11/01/2017 07:06 PM, Markus Jelsma wrote:
> Any ideas?
> 
> Thanks!
> 
>  
>  
> -----Original message-----
>> From:Markus Jelsma <[email protected]>
>> Sent: Tuesday 31st October 2017 13:14
>> To: User <[email protected]>
>> Subject: FW: Incorrect encoding detected
>>
>> I actually don't know, can we specify a tika-config file in Nutch?
>>
>> Thanks,
>> Markus
>>  
>> -----Original message-----
>>> From:Allison, Timothy B. <[email protected]>
>>> Sent: Tuesday 31st October 2017 13:11
>>> To: [email protected]
>>> Subject: RE: Incorrect encoding detected
>>>
>>> For 1.17, the simplest solution, I think, is to allow users to configure 
>>> extending the detection limit via our @Field config methods, that is, via 
>>> tika-config.xml.
>>>
>>> To confirm, Nutch will allow users to specify a tika-config file?  Will 
>>> this work for you and Nutch?
>>>
>>> -----Original Message-----
>>> From: Markus Jelsma [mailto:[email protected]] 
>>> Sent: Tuesday, October 31, 2017 5:47 AM
>>> To: [email protected]
>>> Subject: RE: Incorrect encoding detected
>>>
>>> Hello Timothy - what would be your preferred solution? Increase detection 
>>> limit or skip inline styles and possibly other useless head information?
>>>
>>> Thanks,
>>> Markus
>>>
>>>  
>>>  
>>> -----Original message-----
>>>> From:Markus Jelsma <[email protected]>
>>>> Sent: Friday 27th October 2017 15:37
>>>> To: [email protected]
>>>> Subject: RE: Incorrect encoding detected
>>>>
>>>> Hi Tim,
>>>>
>>>> I have opened TIKA-2485 to track the problem. 
>>>>
>>>> Thank you very very much!
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Allison, Timothy B. <[email protected]>
>>>>> Sent: Friday 27th October 2017 15:33
>>>>> To: [email protected]
>>>>> Subject: RE: Incorrect encoding detected
>>>>>
>>>>> Unfortunately there is no way to do this now.  _I think_ we could make 
>>>>> this configurable, though, fairly easily.  Please open a ticket.
>>>>>
>>>>> The next RC for PDFBox might be out next week, and we'll try to release 
>>>>> Tika 1.17 shortly after that...so there should be time to get this in.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Markus Jelsma [mailto:[email protected]] 
>>>>> Sent: Friday, October 27, 2017 9:12 AM
>>>>> To: [email protected]
>>>>> Subject: RE: Incorrect encoding detected
>>>>>
>>>>> Hello Tim,
>>>>>
>>>>> Getting rid of script and style contents sounds plausible indeed. But to 
>>>>> work around the problem for now, can i instruct HTMLEncodingDetector from 
>>>>> within Nutch to look beyond the limit?
>>>>>
>>>>> Thanks!
>>>>> Markus
>>>>>
>>>>>  
>>>>>  
>>>>> -----Original message-----
>>>>>> From:Allison, Timothy B. <[email protected]>
>>>>>> Sent: Friday 27th October 2017 14:53
>>>>>> To: [email protected]
>>>>>> Subject: RE: Incorrect encoding detected
>>>>>>
>>>>>> Hi Markus,
>>>>>>   
>>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> 
>>>>>> are what is actually being used for encoding detection.  The 
>>>>>> HTMLEncodingDetector only looks in the first 8,192 characters, and the 
>>>>>> other encoding detectors have similar (but longer?) restrictions.
>>>>>>  
>>>>>> At some point, I had a dev version of a stripper that removed contents 
>>>>>> of <script/> and <style/> before trying to detect the 
>>>>>> encoding[0]...perhaps it is time to resurrect that code and integrate it?
>>>>>>
>>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
>>>>>> should expand how far we look into a stream for detection?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>                Tim
>>>>>>
>>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038
>>>>>>    
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Markus Jelsma [mailto:[email protected]] 
>>>>>> Sent: Friday, October 27, 2017 8:39 AM
>>>>>> To: [email protected]
>>>>>> Subject: Incorrect encoding detected
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We have a problem with Tika, encoding and pages on this website: 
>>>>>> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
>>>>>>
>>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that 
>>>>>> the regular HTML parser does a fine job, but our TikaParser has a tough 
>>>>>> job dealing with this HTML. For some reason Tika thinks 
>>>>>> Content-Encoding=windows-1252 is what this webpage says it is, instead 
>>>>>> the page identifies itself properly as UTF-8.
>>>>>>
>>>>>> Of all websites we index, this is so far the only one giving trouble 
>>>>>> indexing accents, getting fÃ¥ instead of a regular få.
>>>>>>
>>>>>> Any tips to spare? 
>>>>>>
>>>>>> Many many thanks!
>>>>>> Markus
>>>>>>
>>>>>
>>>>
>>>
>>

Reply via email to