Hi Bin
The smiplest way is invent cjk-index-basic and cjk-query-basic plugin,
and replace index index-basic and query-basic with them. The invention
is quite simple, you can use CJKTokenizer and CJKAnalyzer in Lucene
project. And take care the query syntax characters in nutch.
// query syntax c
Hi all;
I just have followed Mr. Jack Tang's solution to adopt CJK
analyzer into Nutch 0.6. I know that solution is not perfect. In fact,
I can not get result returned. Can anyone help to adopt CJK analyzer
into Nutch?
Any response is greatly appreciated!
Best Regards
Hi Nils!
If I am not totally off track, the 0.7 version (currently 0.7-dev, in
the CVS trunk) runs as a daemon process. I.e, it will poll the file
with the URL:s when it has nothing else to do, so that will solve your
problem.
Regarding the duplicate content, as you can see in the tutorial there
i
Hello Ferenc,
If the pages are really identical they can removed using "nutch dedup"
command. If not (sometimes such pages differ by some date, counter or
advertisement) - currently there is no such tool that makes it possible
to remove them. I am working on simple tool to remove duplicates lik
Hi! This is the ezmlm program. I'm managing the
nutch-dev@lucene.apache.org mailing list.
I'm working for my owner, who can be reached
at [EMAIL PROTECTED]
To confirm that you would like
nutch-developers@lists.sourceforge.net
removed from the nutch-dev mailing list, please send an empty repl
Hi Fredrik,
thanks for that information.
That sounds really good to me.
I mean it woult be perfect to
handle just one product instead
of different ones for every single task.
Anyway, can you tell me if it is possible that
Clients will insert their "ask for a url" into a url list,
out of which
Hi!
The crawler and link-structure information comes "free" with Nutch.
Once you have crawled a site, you can use the WebDBReader class to
extract the link information for further processing in a visualization
step. Simply put: Iterate crawled pages with the SegmentReader class
(open the segment y
Hi,
I m actually working on a "service" that gives
you the ability to enter a url an visualizes this domain
(only inner links).
Then there ll be some kind of adaptive behaviour
so that the graph will be adapted to your wishes
(searches, ranks ...)
I have a prototype that uses:
1. Arachnid as a
I'm seeing the following behavior, already reported in the past. The
scenario that causes
it is one where the nutch application is starting up and we have queries
coming in already.
in catalina.out:
050709 132712 11 query request from 211.30.2.xx
050709 132712 12 query request from 82.58.45.xx
duplicate pages - virtual hosts in db.
--
Key: NUTCH-70
URL: http://issues.apache.org/jira/browse/NUTCH-70
Project: Nutch
Type: Bug
Environment: 0,7 dev
Reporter: Lutischán Ferenc
Dear Developers,
I have a problem with nutc
10 matches
Mail list logo