[ 
https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801829#action_12801829
 ] 

Ken Krugler commented on TIKA-357:
----------------------------------

Chris - yes, i was using HTMLParser directly in my Bixo web crawling code.

I'm attaching an additional patch, that adds a test case to HtmlParserTest (and 
a test file). It also improves the regex used to find the meta tag, to better 
handle broken HTML. You'll need to apply this on top of the first patch, and 
also add the attached file (big-preamble.html) to the 
src/test/resources/test-documents/ directory.

See https://issues.apache.org/jira/browse/TIKA-332 for the original issue 
w/using meta http-equiv tags for charset detection.

One odd thing is that my HtmlParserTest currently fails...not with anything 
I've touched, but rather the testXhtmlParsing() test...the auto-detected 
mime-type is coming back as text/html, not text/xhtml+xml. This is using the 
most recent from trunk (via the Git repo).


> Increase buffer size for meta tag sniffing
> ------------------------------------------
>
>                 Key: TIKA-357
>                 URL: https://issues.apache.org/jira/browse/TIKA-357
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: big-preamble.html, makler.html, TIKA-357-2.patch, 
> TIKA-357.patch
>
>
> Some web pages (such as makler.su, see attached) have lots of script data 
> before the body of the HTML.
> When this happens, the sniffing code fails to find the charset info in the 
> meta tag, because it currently only sniffs the first 4K.
> Bumping it to 8K would cover all of the cases that I (Ken) have seen during a 
> test crawl.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to