Well send my response before the rest of the thread developed, but my idea was not to store the hash in the LinkDB, but to search the current URL being filtered in the LinkDB to see if it was fetched before, the first portion of the Madan's email stated that he was implementing an URL filter which acts solely on the URL (string) there for my kind of partial-ish recommendation, as for accessing the content to compute the hash an URL filter can not be used, mainly because in this stage the URL hasn't been fetched and is not aware of any content yet, as you explained in your email.
Sorry if my out-of-time email caused some confusion :) Regards, ----- Original Message ----- From: "Markus Jelsma" <[email protected]> To: [email protected] Sent: Wednesday, February 18, 2015 6:31:51 PM Subject: [MASSMAIL]RE: [MASSMAIL]URL filter plugins for nutch Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content derived information such as similarity hashes, nor does it cluster similar URL's as you would want when detecting spider traps. What do you think? Markus -----Original message----- > From:Jorge Luis Betancourt González <[email protected]> > Sent: Wednesday 18th February 2015 23:05 > To: [email protected] > Subject: Re: [MASSMAIL]URL filter plugins for nutch > > The idea behind the URL filter plugins is to decide weather the current URL > (string) should be allowed to be fetched or not, in your particular case I > think that you could try to read the LinkDB and then decide if you want to > fetch or not this particular URL, keep in mind that this logic should be > something fast, because is going to be executed a lot of times (one for each > URL). > > I don't know of any plugin that does this, typically this is kind of hard to > do right (if possible at all), but you can check out the LinkDbReader for a > way to read from LinkDB to do your check. One more detail if you only filter > by the URL you can find resources on the Web where the content has changed > and in this case you will discard the fetching of the resource. > > Regards, > > ----- Original Message ----- > From: "Madan Patil" <[email protected]> > To: "user" <[email protected]> > Sent: Wednesday, February 18, 2015 3:09:00 PM > Subject: [MASSMAIL]URL filter plugins for nutch > > Hi, > > I am working on assignment where I am supposed to use nutch to crawl > antractic data. > I am writing a plugin which extends URLFilter to not crawl duplicate (exact > and near duplicate) URLs. All the plugins, the defaults ones and others on > web, have only one URL. They decide what to do or not to do based on > content of one URL. > > Could any one point me to resources which would help me compare content of > one URL with the ones already crawled? > > Thanks in advance. > > Regards, > Madan Patil >

