I came across a languageidentifier plugin at PluginCentral while trying to
figure out something else. *Maybe *this could be a starting point for you.
http://wiki.apache.org/nutch/PluginCentral
2008/1/16 Volkan Ebil <[EMAIL PROTECTED]>:
> url filter will solve the url limitation problem thanks.Is
url filter will solve the url limitation problem thanks.Is anyone know how i
can add an if check to the crawl process that allows only the sites that
contains special chars like "ç,ü,ğ".Shoul i study on parse algoritm.
Hi
I dnt knw abt the special character part...but u can limit the urls using
conf/urfilter.txt...
Thanx
kishore
-Original Message-
From: Volkan Ebil [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 15, 2008 6:13 PM
To: nutch-user@lucene.apache.org
Subject: Customize Crawling..
Hi,
I
Hi,
I am a new nutch user. My problem is to customize the crawl process.My aim
is to detect and crawl web sites written in my language.I want to crawl only
the sites that contains special chars like "ğ" or "ç" and also ,
i want to limit the urls that ends only with special extensions like
"com