Re: Customize Crawling..

2008-01-16 Thread Manoj Bist
I came across a languageidentifier plugin at PluginCentral while trying to figure out something else. *Maybe *this could be a starting point for you. http://wiki.apache.org/nutch/PluginCentral 2008/1/16 Volkan Ebil <[EMAIL PROTECTED]>: > url filter will solve the url limitation problem thanks.Is

Need pointers regarding accessing crawled data/plugin etc.

2008-01-15 Thread Manoj Bist
Hi, I would really appreciate if someone could provide pointers to doing the following(via plugins or otherwise). I have gone through plugin central on nutch wiki. 1.) Is it possible to have a control on the 'policy' to decide how soon a url is fetched. For e.g. if a document does not change fr

Re: Exception in DeleteDuplicates.java

2008-01-15 Thread Manoj Bist
it updating nutch with the following patch > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg06705.html > > I hope this will help you, good luck! > > 2008/1/13, Manoj Bist <[EMAIL PROTECTED]>: > > Hi, > > > > I am getting the following exception when I

Re: Exception in DeleteDuplicates.java

2008-01-14 Thread Manoj Bist
a similar problem when trying to Dedup, I > solved it updating nutch with the following patch > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg06705.html > > I hope this will help you, good luck! > > 2008/1/13, Manoj Bist <[EMAIL PROTECTED]>: > > Hi, > > &g

Re: 'crawled already exists' - how do I recrawl?

2008-01-12 Thread Manoj Bist
egards, > Susam Pal > > On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote: > > Hi, > > > > When I run crawl the second time, it always complains that 'crawled' > already > > exists. I always need to remove this directory using 'hadoo

Exception in DeleteDuplicates.java

2008-01-12 Thread Manoj Bist
Hi, I am getting the following exception when I do a crawl using nutch. I am kind of stuck due to this. I would really appreciate any pointers in resolving this. I got a related mail thread here but it doesn't describe a solut

'crawled already exists' - how do I recrawl?

2008-01-12 Thread Manoj Bist
Hi, When I run crawl the second time, it always complains that 'crawled' already exists. I always need to remove this directory using 'hadoop dfs -rm crawled' to get going. Is there some way to avoid this error and tell nutch that its a recrawl? bin/nutch crawl urls -dir crawled -depth 1 2>&1 |

Using Nutch for crawling + storing RSS feeds.

2008-01-06 Thread Manoj Bist
Hi, I posted this earlier to hadoop-user mailing list and I was told that I should post this to nutch-user. I would really appreciate any response to this. I need to build a system that crawls a given set of RSS feed urls periodically. For each RSS feed, the system needs to maintain a master RSS