[Nutch-dev] Re: hi all

2005-07-11 Thread Jack Tang
Hi Bin The smiplest way is invent cjk-index-basic and cjk-query-basic plugin, and replace index index-basic and query-basic with them. The invention is quite simple, you can use CJKTokenizer and CJKAnalyzer in Lucene project. And take care the query syntax characters in nutch. // query syntax c

[Nutch-dev] hi all

2005-07-11 Thread Bin Shi
Hi all; I just have followed Mr. Jack Tang's solution to adopt CJK analyzer into Nutch 0.6. I know that solution is not perfect. In fact, I can not get result returned. Can anyone help to adopt CJK analyzer into Nutch? Any response is greatly appreciated! Best Regards

[Nutch-dev] Re: Website Visualization Questions

2005-07-11 Thread Fredrik Andersson
Hi Nils! If I am not totally off track, the 0.7 version (currently 0.7-dev, in the CVS trunk) runs as a daemon process. I.e, it will poll the file with the URL:s when it has nothing else to do, so that will solve your problem. Regarding the duplicate content, as you can see in the tutorial there i

[Nutch-dev] Re: [jira] Created: (NUTCH-70) duplicate pages - virtual hosts in db.

2005-07-11 Thread Piotr Kosiorowski
Hello Ferenc, If the pages are really identical they can removed using "nutch dedup" command. If not (sometimes such pages differ by some date, counter or advertisement) - currently there is no such tool that makes it possible to remove them. I am working on simple tool to remove duplicates lik

[Nutch-dev] confirm unsubscribe from nutch-dev@lucene.apache.org

2005-07-11 Thread nutch-dev-help
Hi! This is the ezmlm program. I'm managing the nutch-dev@lucene.apache.org mailing list. I'm working for my owner, who can be reached at [EMAIL PROTECTED] To confirm that you would like nutch-developers@lists.sourceforge.net removed from the nutch-dev mailing list, please send an empty repl

[Nutch-dev] Re: Website Visualization Questions

2005-07-11 Thread Nils Höller
Hi Fredrik, thanks for that information. That sounds really good to me. I mean it woult be perfect to handle just one product instead of different ones for every single task. Anyway, can you tell me if it is possible that Clients will insert their "ask for a url" into a url list, out of which

[Nutch-dev] Re: Website Visualization Questions

2005-07-11 Thread Fredrik Andersson
Hi! The crawler and link-structure information comes "free" with Nutch. Once you have crawled a site, you can use the WebDBReader class to extract the link information for further processing in a visualization step. Simply put: Iterate crawled pages with the SegmentReader class (open the segment y

[Nutch-dev] Website Visualization Questions

2005-07-11 Thread Nils Hoeller
Hi, I m actually working on a "service" that gives you the ability to enter a url an visualizes this domain (only inner links). Then there ll be some kind of adaptive behaviour so that the graph will be adapted to your wishes (searches, ranks ...) I have a prototype that uses: 1. Arachnid as a

[Nutch-dev] Possible race condition while loading plugins

2005-07-11 Thread Diego Basch
I'm seeing the following behavior, already reported in the past. The scenario that causes it is one where the nutch application is starting up and we have queries coming in already. in catalina.out: 050709 132712 11 query request from 211.30.2.xx 050709 132712 12 query request from 82.58.45.xx

[Nutch-dev] [jira] Created: (NUTCH-70) duplicate pages - virtual hosts in db.

2005-07-11 Thread JIRA
duplicate pages - virtual hosts in db. -- Key: NUTCH-70 URL: http://issues.apache.org/jira/browse/NUTCH-70 Project: Nutch Type: Bug Environment: 0,7 dev Reporter: Lutischán Ferenc Dear Developers, I have a problem with nutc