RE: URL filter plugins for nutch
Easiest is to set the signature to TextProfileSignature and delete duplicates from the index, but you will still crawl and waste resources on them. Or are you by any change trying to prevent spider traps from being crawled? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:58 To: user user@nutch.apache.org Subject: Re: URL filter plugins for nutch Hi Markus, I am looking for the one's with similar in content. Regards, Madan Patil On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io wrote: By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:10 To: user user@nutch.apache.org Subject: URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
RE: URL filter plugins for nutch
Hi - this is not going to work. URLFilter interface operates on single URL's only, it is not aware of content, it is not aware of possible metadata (simhash) attached to the CrawlDatum. It would be more straightforward to implement Signature and calculate the simhash there. Now, Nutch has a DeduplicationJob but it it operates on equals signatures as mapreduce key, and this is not going to work with simhashes. I remember there was a trick to get similar hashes in the same key buckets by emitting them to multiple buckets from the mapper, so then in the reducer you can do a sorensen similarity on the hashes. This is really tricky stuff, especially getting the hashes in the same bucket. Are you doing this for removing duplicates from search results? Then it might be more easier to implement the sorensen similarity in a custom Lucene collector. Because the top docs contain duplicates, the pass through the same collector implementation and a single point to remove them. The problem now is that it wont really work with distributed search, unless you hash similar URL's to the same shard, but the cluster now becomes unbalanced and difficult to manage, plus that IDF and norms become skewed. Good luck, we have tried many different approaches to this problem, especially online deduplication. But offline is also hard because of reducer keys. Markus -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 22:16 To: user user@nutch.apache.org Subject: Re: URL filter plugins for nutch I am not sure if I understand you right. But here is what I am trying to implement, I have implemented Charikar's simhash and now want to use it to detect near-duplicates/duplicates. I would like to make it a plugin(which implements URLFilter interface). Hence filter all those URLs, whose content is nearly same as one which have alrady been fetched. Would this be possible or I am heading in wrong direction. Thanks for your patience Markus. Regards, Madan Patil On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma markus.jel...@openindex.io wrote: Easiest is to set the signature to TextProfileSignature and delete duplicates from the index, but you will still crawl and waste resources on them. Or are you by any change trying to prevent spider traps from being crawled? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:58 To: user user@nutch.apache.org Subject: Re: URL filter plugins for nutch Hi Markus, I am looking for the one's with similar in content. Regards, Madan Patil On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io wrote: By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:10 To: user user@nutch.apache.org Subject: URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
Re: URL filter plugins for nutch
This is really tricky given the familiarity with nutch, I have. I will try with sorensen as you suggested. Thanks for the input Markus. Regards, Madan Patil On Wed, Feb 18, 2015 at 1:28 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - this is not going to work. URLFilter interface operates on single URL's only, it is not aware of content, it is not aware of possible metadata (simhash) attached to the CrawlDatum. It would be more straightforward to implement Signature and calculate the simhash there. Now, Nutch has a DeduplicationJob but it it operates on equals signatures as mapreduce key, and this is not going to work with simhashes. I remember there was a trick to get similar hashes in the same key buckets by emitting them to multiple buckets from the mapper, so then in the reducer you can do a sorensen similarity on the hashes. This is really tricky stuff, especially getting the hashes in the same bucket. Are you doing this for removing duplicates from search results? Then it might be more easier to implement the sorensen similarity in a custom Lucene collector. Because the top docs contain duplicates, the pass through the same collector implementation and a single point to remove them. The problem now is that it wont really work with distributed search, unless you hash similar URL's to the same shard, but the cluster now becomes unbalanced and difficult to manage, plus that IDF and norms become skewed. Good luck, we have tried many different approaches to this problem, especially online deduplication. But offline is also hard because of reducer keys. Markus -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 22:16 To: user user@nutch.apache.org Subject: Re: URL filter plugins for nutch I am not sure if I understand you right. But here is what I am trying to implement, I have implemented Charikar's simhash and now want to use it to detect near-duplicates/duplicates. I would like to make it a plugin(which implements URLFilter interface). Hence filter all those URLs, whose content is nearly same as one which have alrady been fetched. Would this be possible or I am heading in wrong direction. Thanks for your patience Markus. Regards, Madan Patil On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma markus.jel...@openindex.io wrote: Easiest is to set the signature to TextProfileSignature and delete duplicates from the index, but you will still crawl and waste resources on them. Or are you by any change trying to prevent spider traps from being crawled? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:58 To: user user@nutch.apache.org Subject: Re: URL filter plugins for nutch Hi Markus, I am looking for the one's with similar in content. Regards, Madan Patil On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io wrote: By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:10 To: user user@nutch.apache.org Subject: URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
Re: about indexing to multiple solr servers
Hi Eyeris, On Wed, Feb 18, 2015 at 12:10 PM, user-digest-h...@nutch.apache.org wrote: I have a question and sorry if it is a trivial things. Is there any way to index in multiple solr server (at least 2) using nutch 1.9 ? I have configured solr with one master and 2 slaves, but i need 2 master and 2 slaves, the problem is how nutch can index in more than 1 solr. If i have only one solr master and it fails it is a problem. Any advice or post will be accepted. There are a couple of open issues with patches you can reference. https://issues.apache.org/jira/browse/NUTCH-1480 https://issues.apache.org/jira/browse/NUTCH-945 I think actually re-writing some of these patches to apply to trunk 1.10 would be really fantastic. If you make an progress on this then please comment on this issues. Thanks Lewis
URL filter plugins for nutch
Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
Re: URL filter plugins for nutch
Hi Markus, I am looking for the one's with similar in content. Regards, Madan Patil On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io wrote: By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:10 To: user user@nutch.apache.org Subject: URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
Re: URL filter plugins for nutch
I am not sure if I understand you right. But here is what I am trying to implement, I have implemented Charikar's simhash and now want to use it to detect near-duplicates/duplicates. I would like to make it a plugin(which implements URLFilter interface). Hence filter all those URLs, whose content is nearly same as one which have alrady been fetched. Would this be possible or I am heading in wrong direction. Thanks for your patience Markus. Regards, Madan Patil On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma markus.jel...@openindex.io wrote: Easiest is to set the signature to TextProfileSignature and delete duplicates from the index, but you will still crawl and waste resources on them. Or are you by any change trying to prevent spider traps from being crawled? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:58 To: user user@nutch.apache.org Subject: Re: URL filter plugins for nutch Hi Markus, I am looking for the one's with similar in content. Regards, Madan Patil On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io wrote: By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:10 To: user user@nutch.apache.org Subject: URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
RE: URL filter plugins for nutch
By near-duplicate you mean similar URL's, or URL's with similar content? -Original message- From:Madan Patil madan...@usc.edu Sent: Wednesday 18th February 2015 21:10 To: user user@nutch.apache.org Subject: URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
Re: [MASSMAIL]URL filter plugins for nutch
The idea behind the URL filter plugins is to decide weather the current URL (string) should be allowed to be fetched or not, in your particular case I think that you could try to read the LinkDB and then decide if you want to fetch or not this particular URL, keep in mind that this logic should be something fast, because is going to be executed a lot of times (one for each URL). I don't know of any plugin that does this, typically this is kind of hard to do right (if possible at all), but you can check out the LinkDbReader for a way to read from LinkDB to do your check. One more detail if you only filter by the URL you can find resources on the Web where the content has changed and in this case you will discard the fetching of the resource. Regards, - Original Message - From: Madan Patil madan...@usc.edu To: user user@nutch.apache.org Sent: Wednesday, February 18, 2015 3:09:00 PM Subject: [MASSMAIL]URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
NUTCH-762 Generate Multiple Segments
Hi Folks, I am facing the exact same problem that is described in JIRA NUTCH-762 , i.e the generate -updates takes excessive amount of time and the actual fetch only takes very less time compared to the generate time. The Jira issue commits a patch to allow generating of multiple segments in a single generate phase, however I was not able to do so . How can I generate multiple segments in a single generate phase ? I am using Nutch 1.7 , any help would be greatly appreciated. I am using Nutc 1.7 on YARN 2.3.0 Thanks.
Re: [MASSMAIL]RE: [MASSMAIL]URL filter plugins for nutch
Well send my response before the rest of the thread developed, but my idea was not to store the hash in the LinkDB, but to search the current URL being filtered in the LinkDB to see if it was fetched before, the first portion of the Madan's email stated that he was implementing an URL filter which acts solely on the URL (string) there for my kind of partial-ish recommendation, as for accessing the content to compute the hash an URL filter can not be used, mainly because in this stage the URL hasn't been fetched and is not aware of any content yet, as you explained in your email. Sorry if my out-of-time email caused some confusion :) Regards, - Original Message - From: Markus Jelsma markus.jel...@openindex.io To: user@nutch.apache.org Sent: Wednesday, February 18, 2015 6:31:51 PM Subject: [MASSMAIL]RE: [MASSMAIL]URL filter plugins for nutch Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content derived information such as similarity hashes, nor does it cluster similar URL's as you would want when detecting spider traps. What do you think? Markus -Original message- From:Jorge Luis Betancourt González jlbetanco...@uci.cu Sent: Wednesday 18th February 2015 23:05 To: user@nutch.apache.org Subject: Re: [MASSMAIL]URL filter plugins for nutch The idea behind the URL filter plugins is to decide weather the current URL (string) should be allowed to be fetched or not, in your particular case I think that you could try to read the LinkDB and then decide if you want to fetch or not this particular URL, keep in mind that this logic should be something fast, because is going to be executed a lot of times (one for each URL). I don't know of any plugin that does this, typically this is kind of hard to do right (if possible at all), but you can check out the LinkDbReader for a way to read from LinkDB to do your check. One more detail if you only filter by the URL you can find resources on the Web where the content has changed and in this case you will discard the fetching of the resource. Regards, - Original Message - From: Madan Patil madan...@usc.edu To: user user@nutch.apache.org Sent: Wednesday, February 18, 2015 3:09:00 PM Subject: [MASSMAIL]URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil
RE: [MASSMAIL]URL filter plugins for nutch
Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content derived information such as similarity hashes, nor does it cluster similar URL's as you would want when detecting spider traps. What do you think? Markus -Original message- From:Jorge Luis Betancourt González jlbetanco...@uci.cu Sent: Wednesday 18th February 2015 23:05 To: user@nutch.apache.org Subject: Re: [MASSMAIL]URL filter plugins for nutch The idea behind the URL filter plugins is to decide weather the current URL (string) should be allowed to be fetched or not, in your particular case I think that you could try to read the LinkDB and then decide if you want to fetch or not this particular URL, keep in mind that this logic should be something fast, because is going to be executed a lot of times (one for each URL). I don't know of any plugin that does this, typically this is kind of hard to do right (if possible at all), but you can check out the LinkDbReader for a way to read from LinkDB to do your check. One more detail if you only filter by the URL you can find resources on the Web where the content has changed and in this case you will discard the fetching of the resource. Regards, - Original Message - From: Madan Patil madan...@usc.edu To: user user@nutch.apache.org Sent: Wednesday, February 18, 2015 3:09:00 PM Subject: [MASSMAIL]URL filter plugins for nutch Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL. Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil