[jira] Updated: (NUTCH-624) Better parsed text by default parser

2008-04-01 Thread Vinci (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinci updated NUTCH-624: Description: I found the parsed text by default parser, Neko in 1.0 nightly is not easy to process - it just

Re: [jira] Created: (NUTCH-624) Better parsed text

2008-04-01 Thread Vinci
Hi, Thank you for your feedback. The default parsed text dumped by readseg utility is just giving the parsed text in space, that is not easy to process: I need to process the text in sentence-by-sentence manner.However in most of page I crawled, there is no footstop or comma appear in the end of

[jira] Updated: (NUTCH-625) Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte)

2008-04-01 Thread Vinci (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinci updated NUTCH-625: Description: If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii character(i.e.

Is there any LSI implementation?

2008-04-01 Thread Edward J. Yoon
Hi, i'm newbie in here. When I read Better Search with Apacke Lucene and Solr, i found a LSI approache. Is there any LSI implementation? I'm interested in the problem of scalability and the parallel matrix operations. Thanks. -- B. Regards, Edward J. Yoon