Better parsed text
--
Key: NUTCH-624
URL: https://issues.apache.org/jira/browse/NUTCH-624
Project: Nutch
Issue Type: Improvement
Reporter: Vinci
I found the parsed text by default parser Neko is not easy to
Non-ascii character broken in dumped content for mixed encoding (utf-8 and
multi-byte)
--
Key: NUTCH-625
URL: https://issues.apache.org/jira/browse/NUTCH-625
[
https://issues.apache.org/jira/browse/NUTCH-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinci updated NUTCH-624:
Description:
I found the parsed text by default parser, Neko in 1.0 nightly is not easy to
process - it just
[
https://issues.apache.org/jira/browse/NUTCH-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinci updated NUTCH-625:
Description:
If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii
character(i.e.