Where to put plugin specific parameters / configurations

2009-03-18 Thread MyD
Hi @ all, where is it possible to set plugin (my own plugin) specific parameters / configurations? Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Where-to-put-plugin-specific-parameters---configurations-tp22577145p22577145.html Sent from the Nutch -

Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread MyD
Hi @ all, is it possible to set the next fetch schedule for a url in another crawl dir? Example: crawl.dir.A - retrieve links and set the fetch schedule but this should go into the crawl.dir.B crawl.dir.B Thanks in advance Regards, MyD -- View this message in context:

Re: Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread Stevan Kovacevic
well you can always write a bash script or a java class that does this. writing a java class is probably better and easier. you have a manual for importing nutch into eclipse in case you don't know how. i needed a similar thing done and it turned out that using java really is easier... On Wed,

Re: Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread MyD
Hi ripper, Thanks, do u know how to do it in java? I tried to, but haven't found the suitable classes. Thanks in advance. Cheers, MyD ripper07 wrote: well you can always write a bash script or a java class that does this. writing a java class is probably better and easier. you have a

Re: Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread n_developer
you have a manual for importing nutch into eclipse in case you don't know how can u pl mention the link... thanx in advance ripper07 wrote: well you can always write a bash script or a java class that does this. writing a java class is probably better and easier. you have a manual for

Re: Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread Stevan Kovacevic
ok this is how i did it. i created a class in the org.apache.nutch.crawl package, the same package where the crawl class (which is nutch's main class, called by the crawl command). in that class, you create the crawl class with the appropriate parameter. just look at the code once you import it

Re: Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread n_developer
ripper07 wrote: ok this is how i did it. i created a class in the org.apache.nutch.crawl package, the same package where the crawl class (which is nutch's main class, called by the crawl command). in that class, you create the crawl class with the appropriate parameter. just look at the

Incremental index update

2009-03-18 Thread Huang, Zijian(Victor)
Hi, I heard from my friends that doing incremental index update in Nutch is not easy. Is it there a way to configure the Nutch crawler to craw only the changed website and then update the existing index? Thanks Victor

Re: The Future of Nutch

2009-03-18 Thread Alex Basa
I actually use Nutch as a large scale search engine on two products. I think a few things that would be nice to have are built in options to produce an incremental index and maybe a quartz scheduler to automate it completely. One thing that would be nice is when one of us figures something

Cleaning after job failed

2009-03-18 Thread Bartosz Gadzimski
Hi, During tests of crawling (with crawl command) big 1mln website HDD space was run out. So I have crawldb with 1 112 000 urls (112 000 urls were tested before) segments with 40GB of data index with partial data /tmp/hadoop-root with 173GB of temporary hadoop data After looking at mailing

index web

2009-03-18 Thread 陈琛
hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome and

MergeSegments Error.

2009-03-18 Thread Armando Gonçalves
When I try to merge the Segments of two crawls, 2Gb and 1Gb each. I get a very bizarre eror: Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at

Re: embed nutch crawl in an application

2009-03-18 Thread yanky young
Hi: you can see source code of Crawl class which can be used to start nutch by java command without cgywin. java -D... -classpath ... org.apache.nutch.crawl.Crawl urls -depth 10 -topN 1000 good luck yanky 2009/3/18 MyD myd.ro...@googlemail.com This is an interesting question. If you know

Re: Where to put plugin specific parameters / configurations

2009-03-18 Thread yanky young
Hi: you can put any parameters in nutch-site.xml as property settings, and get property from your plugin class by conf.get(your property name) good luck yanky 2009/3/18 MyD myd.ro...@googlemail.com Hi @ all, where is it possible to set plugin (my own plugin) specific parameters /

Re: Incremental index update

2009-03-18 Thread yanky young
Hi: according to my understanding, in nuch 1.0, you can configure nutch to recrawl with a specific schedule: see this issue: http://issues.apache.org/jira/browse/NUTCH-61 and this class: AdaptiveFetchSchedule by the way, there is no way to configure nutch to only recrawl changed website,

Fwd: index web

2009-03-18 Thread 陈琛
please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like