[jira] Created: (NUTCH-624) Better parsed text

2008-03-30 Thread Vinci (JIRA)
Better parsed text -- Key: NUTCH-624 URL: https://issues.apache.org/jira/browse/NUTCH-624 Project: Nutch Issue Type: Improvement Reporter: Vinci I found the parsed text by default parser Neko is not easy to

[jira] Created: (NUTCH-625) Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte)

2008-03-30 Thread Vinci (JIRA)
Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte) -- Key: NUTCH-625 URL: https://issues.apache.org/jira/browse/NUTCH-625

[jira] Updated: (NUTCH-624) Better parsed text by default parser

2008-04-01 Thread Vinci (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinci updated NUTCH-624: Description: I found the parsed text by default parser, Neko in 1.0 nightly is not easy to process - it just

[jira] Updated: (NUTCH-625) Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte)

2008-04-01 Thread Vinci (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinci updated NUTCH-625: Description: If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii character(i.e.