[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Attachment: remover.diff just FYI, you can further filter which element neko should keep and remove. see the patch for an example and http://people.apache.org/~andyc/neko/doc/html/settings.html > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: all >Reporter: King Kong >Priority: Minor > Attachments: patch.diff, remover.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Priority: Minor (was: Major) Affects Version/s: (was: 0.8) 0.9.0 > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: all >Reporter: King Kong >Priority: Minor > Attachments: patch.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Attachment: patch.diff unified diff against head. - fixes encoding, as described by King Kong - removes non-valid features - fixes logging > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.8 > Environment: all >Reporter: King Kong > Attachments: patch.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.