Re: Crawling same domain URL's
Thanks Sebastian. This makes sense. I will override URLPartitioner accordingly. Regards Prateek On Tue, May 11, 2021 at 4:16 PM Sebastian Nagel wrote: > Hi Prateek, > > alternatively, you could modify the URLPartitioner [1], so that during the > "generate" step > the URLs of a specific host or domain are distributed over more > partitions. One partition > is the fetch list of one fetcher map task. At Common Crawl we partition > by domain and made > the number of partitions configurable to assign more fetcher tasks to > certain super-domains, > e.g. wordpress.com or blogspot.com, see [2]. > > Best, > Sebastian > > [1] > https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/src/java/org/apache/nutch/crawl/URLPartitioner.java#L75 > [2] > > https://github.com/commoncrawl/nutch/blob/98a137910aa30dcb4fa1acd720fb4a4b7d9c520f/src/java/org/apache/nutch/crawl/URLPartitioner.java#L131 > (used by Generator2) > > > On 5/11/21 3:07 PM, Markus Jelsma wrote: > > Hello Prateek, > > > > You are right, it is limited by the number of CPU cores and how many > > threads it can handle, but you can still process a million records per > day > > if you have a few cores. If you parse as a separate step, it can run even > > faster. > > > > Indeed, it won't work if you need to process 10 million recors of the > same > > host every day. If you want to use Hadoop for this, you can opt for a > > custom YARN application [1]. We have done that too for some of our > > distributed tools, it works very nice. > > > > Regards, > > Markus > > > > [1] > > > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html > > > > Op di 11 mei 2021 om 14:54 schreef prateek : > > > >> Hi Markus, > >> > >> Depending upon the core of the machine, I can only increase the number > of > >> threads upto a limit. After that performance degradation will come into > the > >> picture. > >> So running a single mapper will still be a bottleneck in this case. I am > >> looking for options to distribute the same domain URLs across various > >> mappers. Not sure if that's even possible with Nutch or not. > >> > >> Regards > >> Prateek > >> > >> On Tue, May 11, 2021 at 11:58 AM Markus Jelsma < > markus.jel...@openindex.io > >>> > >> wrote: > >> > >>> Hello Prateet, > >>> > >>> If you want to fetch stuff from the same host/domain as fast as you > want, > >>> increase the number of threads, and the number of threads per queue. > Then > >>> decrease all the fetch delays. > >>> > >>> Regards, > >>> Markus > >>> > >>> Op di 11 mei 2021 om 12:48 schreef prateek : > >>> > Hi Lewis, > > As mentioned earlier, it does not matter how many mappers I assign to > >>> fetch > tasks. Since all the URLs are of the same domain, everything will be > assigned to the same mapper and all other mappers will have no task to > execute. So I am looking for ways I can crawl the same domain URLs > >>> quickly. > > Regards > Prateek > > On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney < > >> lewi...@apache.org > > wrote: > > > Hi Prateek, > > mapred.map.tasks -->mapreduce.job.maps > > mapred.reduce.tasks -->mapreduce.job.reduces > > You should be able to override in these in nutch-site.xml then > >> publish > >>> to > > your Hadoop cluster. > > lewismc > > > > On 2021/05/07 15:18:38, prateek wrote: > >> Hi, > >> > >> I am trying to crawl URLs belonging to the same domain (around > >> 140k) > and > >> because of the fact that all the same domain URLs go to the same > mapper, > >> only one mapper is used for fetching. All others are just a waste > >> of > >> resources. These are the configurations I have tried till now but > >>> it's > >> still very slow. > >> > >> Attempt 1 - > >> fetcher.threads.fetch : 10 > >> fetcher.server.delay : 1 > >> fetcher.threads.per.queue : 1, > >> fetcher.server.min.delay : 0.0 > >> > >> Attempt 2 - > >> fetcher.threads.fetch : 10 > >> fetcher.server.delay : 1 > >> fetcher.threads.per.queue : 3, > >> fetcher.server.min.delay : 0.5 > >> > >> Is there a way to distribute the same domain URLs across all the > >> fetcher.threads.fetch? I understand that in this case crawl delay > cannot > > be > >> reinforced across different mappers but for my use case it's ok to > crawl > >> aggressively. So any suggestions? > >> > >> Regards > >> Prateek > >> > > > > >>> > >> > > > >
Re: Crawling same domain URL's
Hi Prateek, alternatively, you could modify the URLPartitioner [1], so that during the "generate" step the URLs of a specific host or domain are distributed over more partitions. One partition is the fetch list of one fetcher map task. At Common Crawl we partition by domain and made the number of partitions configurable to assign more fetcher tasks to certain super-domains, e.g. wordpress.com or blogspot.com, see [2]. Best, Sebastian [1] https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/src/java/org/apache/nutch/crawl/URLPartitioner.java#L75 [2] https://github.com/commoncrawl/nutch/blob/98a137910aa30dcb4fa1acd720fb4a4b7d9c520f/src/java/org/apache/nutch/crawl/URLPartitioner.java#L131 (used by Generator2) On 5/11/21 3:07 PM, Markus Jelsma wrote: Hello Prateek, You are right, it is limited by the number of CPU cores and how many threads it can handle, but you can still process a million records per day if you have a few cores. If you parse as a separate step, it can run even faster. Indeed, it won't work if you need to process 10 million recors of the same host every day. If you want to use Hadoop for this, you can opt for a custom YARN application [1]. We have done that too for some of our distributed tools, it works very nice. Regards, Markus [1] https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html Op di 11 mei 2021 om 14:54 schreef prateek : Hi Markus, Depending upon the core of the machine, I can only increase the number of threads upto a limit. After that performance degradation will come into the picture. So running a single mapper will still be a bottleneck in this case. I am looking for options to distribute the same domain URLs across various mappers. Not sure if that's even possible with Nutch or not. Regards Prateek On Tue, May 11, 2021 at 11:58 AM Markus Jelsma wrote: Hello Prateet, If you want to fetch stuff from the same host/domain as fast as you want, increase the number of threads, and the number of threads per queue. Then decrease all the fetch delays. Regards, Markus Op di 11 mei 2021 om 12:48 schreef prateek : Hi Lewis, As mentioned earlier, it does not matter how many mappers I assign to fetch tasks. Since all the URLs are of the same domain, everything will be assigned to the same mapper and all other mappers will have no task to execute. So I am looking for ways I can crawl the same domain URLs quickly. Regards Prateek On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney < lewi...@apache.org wrote: Hi Prateek, mapred.map.tasks -->mapreduce.job.maps mapred.reduce.tasks -->mapreduce.job.reduces You should be able to override in these in nutch-site.xml then publish to your Hadoop cluster. lewismc On 2021/05/07 15:18:38, prateek wrote: Hi, I am trying to crawl URLs belonging to the same domain (around 140k) and because of the fact that all the same domain URLs go to the same mapper, only one mapper is used for fetching. All others are just a waste of resources. These are the configurations I have tried till now but it's still very slow. Attempt 1 - fetcher.threads.fetch : 10 fetcher.server.delay : 1 fetcher.threads.per.queue : 1, fetcher.server.min.delay : 0.0 Attempt 2 - fetcher.threads.fetch : 10 fetcher.server.delay : 1 fetcher.threads.per.queue : 3, fetcher.server.min.delay : 0.5 Is there a way to distribute the same domain URLs across all the fetcher.threads.fetch? I understand that in this case crawl delay cannot be reinforced across different mappers but for my use case it's ok to crawl aggressively. So any suggestions? Regards Prateek
Re: Crawling same domain URL's
Hello Prateek, You are right, it is limited by the number of CPU cores and how many threads it can handle, but you can still process a million records per day if you have a few cores. If you parse as a separate step, it can run even faster. Indeed, it won't work if you need to process 10 million recors of the same host every day. If you want to use Hadoop for this, you can opt for a custom YARN application [1]. We have done that too for some of our distributed tools, it works very nice. Regards, Markus [1] https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html Op di 11 mei 2021 om 14:54 schreef prateek : > Hi Markus, > > Depending upon the core of the machine, I can only increase the number of > threads upto a limit. After that performance degradation will come into the > picture. > So running a single mapper will still be a bottleneck in this case. I am > looking for options to distribute the same domain URLs across various > mappers. Not sure if that's even possible with Nutch or not. > > Regards > Prateek > > On Tue, May 11, 2021 at 11:58 AM Markus Jelsma > > wrote: > > > Hello Prateet, > > > > If you want to fetch stuff from the same host/domain as fast as you want, > > increase the number of threads, and the number of threads per queue. Then > > decrease all the fetch delays. > > > > Regards, > > Markus > > > > Op di 11 mei 2021 om 12:48 schreef prateek : > > > > > Hi Lewis, > > > > > > As mentioned earlier, it does not matter how many mappers I assign to > > fetch > > > tasks. Since all the URLs are of the same domain, everything will be > > > assigned to the same mapper and all other mappers will have no task to > > > execute. So I am looking for ways I can crawl the same domain URLs > > quickly. > > > > > > Regards > > > Prateek > > > > > > On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney < > lewi...@apache.org > > > > > > wrote: > > > > > > > Hi Prateek, > > > > mapred.map.tasks -->mapreduce.job.maps > > > > mapred.reduce.tasks -->mapreduce.job.reduces > > > > You should be able to override in these in nutch-site.xml then > publish > > to > > > > your Hadoop cluster. > > > > lewismc > > > > > > > > On 2021/05/07 15:18:38, prateek wrote: > > > > > Hi, > > > > > > > > > > I am trying to crawl URLs belonging to the same domain (around > 140k) > > > and > > > > > because of the fact that all the same domain URLs go to the same > > > mapper, > > > > > only one mapper is used for fetching. All others are just a waste > of > > > > > resources. These are the configurations I have tried till now but > > it's > > > > > still very slow. > > > > > > > > > > Attempt 1 - > > > > > fetcher.threads.fetch : 10 > > > > > fetcher.server.delay : 1 > > > > > fetcher.threads.per.queue : 1, > > > > > fetcher.server.min.delay : 0.0 > > > > > > > > > > Attempt 2 - > > > > > fetcher.threads.fetch : 10 > > > > > fetcher.server.delay : 1 > > > > > fetcher.threads.per.queue : 3, > > > > > fetcher.server.min.delay : 0.5 > > > > > > > > > > Is there a way to distribute the same domain URLs across all the > > > > > fetcher.threads.fetch? I understand that in this case crawl delay > > > cannot > > > > be > > > > > reinforced across different mappers but for my use case it's ok to > > > crawl > > > > > aggressively. So any suggestions? > > > > > > > > > > Regards > > > > > Prateek > > > > > > > > > > > > > > >
Re: Crawling same domain URL's
Hi Markus, Depending upon the core of the machine, I can only increase the number of threads upto a limit. After that performance degradation will come into the picture. So running a single mapper will still be a bottleneck in this case. I am looking for options to distribute the same domain URLs across various mappers. Not sure if that's even possible with Nutch or not. Regards Prateek On Tue, May 11, 2021 at 11:58 AM Markus Jelsma wrote: > Hello Prateet, > > If you want to fetch stuff from the same host/domain as fast as you want, > increase the number of threads, and the number of threads per queue. Then > decrease all the fetch delays. > > Regards, > Markus > > Op di 11 mei 2021 om 12:48 schreef prateek : > > > Hi Lewis, > > > > As mentioned earlier, it does not matter how many mappers I assign to > fetch > > tasks. Since all the URLs are of the same domain, everything will be > > assigned to the same mapper and all other mappers will have no task to > > execute. So I am looking for ways I can crawl the same domain URLs > quickly. > > > > Regards > > Prateek > > > > On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney > > > wrote: > > > > > Hi Prateek, > > > mapred.map.tasks -->mapreduce.job.maps > > > mapred.reduce.tasks -->mapreduce.job.reduces > > > You should be able to override in these in nutch-site.xml then publish > to > > > your Hadoop cluster. > > > lewismc > > > > > > On 2021/05/07 15:18:38, prateek wrote: > > > > Hi, > > > > > > > > I am trying to crawl URLs belonging to the same domain (around 140k) > > and > > > > because of the fact that all the same domain URLs go to the same > > mapper, > > > > only one mapper is used for fetching. All others are just a waste of > > > > resources. These are the configurations I have tried till now but > it's > > > > still very slow. > > > > > > > > Attempt 1 - > > > > fetcher.threads.fetch : 10 > > > > fetcher.server.delay : 1 > > > > fetcher.threads.per.queue : 1, > > > > fetcher.server.min.delay : 0.0 > > > > > > > > Attempt 2 - > > > > fetcher.threads.fetch : 10 > > > > fetcher.server.delay : 1 > > > > fetcher.threads.per.queue : 3, > > > > fetcher.server.min.delay : 0.5 > > > > > > > > Is there a way to distribute the same domain URLs across all the > > > > fetcher.threads.fetch? I understand that in this case crawl delay > > cannot > > > be > > > > reinforced across different mappers but for my use case it's ok to > > crawl > > > > aggressively. So any suggestions? > > > > > > > > Regards > > > > Prateek > > > > > > > > > >
Re: Crawling same domain URL's
Hello Prateet, If you want to fetch stuff from the same host/domain as fast as you want, increase the number of threads, and the number of threads per queue. Then decrease all the fetch delays. Regards, Markus Op di 11 mei 2021 om 12:48 schreef prateek : > Hi Lewis, > > As mentioned earlier, it does not matter how many mappers I assign to fetch > tasks. Since all the URLs are of the same domain, everything will be > assigned to the same mapper and all other mappers will have no task to > execute. So I am looking for ways I can crawl the same domain URLs quickly. > > Regards > Prateek > > On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney > wrote: > > > Hi Prateek, > > mapred.map.tasks -->mapreduce.job.maps > > mapred.reduce.tasks -->mapreduce.job.reduces > > You should be able to override in these in nutch-site.xml then publish to > > your Hadoop cluster. > > lewismc > > > > On 2021/05/07 15:18:38, prateek wrote: > > > Hi, > > > > > > I am trying to crawl URLs belonging to the same domain (around 140k) > and > > > because of the fact that all the same domain URLs go to the same > mapper, > > > only one mapper is used for fetching. All others are just a waste of > > > resources. These are the configurations I have tried till now but it's > > > still very slow. > > > > > > Attempt 1 - > > > fetcher.threads.fetch : 10 > > > fetcher.server.delay : 1 > > > fetcher.threads.per.queue : 1, > > > fetcher.server.min.delay : 0.0 > > > > > > Attempt 2 - > > > fetcher.threads.fetch : 10 > > > fetcher.server.delay : 1 > > > fetcher.threads.per.queue : 3, > > > fetcher.server.min.delay : 0.5 > > > > > > Is there a way to distribute the same domain URLs across all the > > > fetcher.threads.fetch? I understand that in this case crawl delay > cannot > > be > > > reinforced across different mappers but for my use case it's ok to > crawl > > > aggressively. So any suggestions? > > > > > > Regards > > > Prateek > > > > > >
Re: Crawling same domain URL's
Hi Lewis, As mentioned earlier, it does not matter how many mappers I assign to fetch tasks. Since all the URLs are of the same domain, everything will be assigned to the same mapper and all other mappers will have no task to execute. So I am looking for ways I can crawl the same domain URLs quickly. Regards Prateek On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney wrote: > Hi Prateek, > mapred.map.tasks -->mapreduce.job.maps > mapred.reduce.tasks -->mapreduce.job.reduces > You should be able to override in these in nutch-site.xml then publish to > your Hadoop cluster. > lewismc > > On 2021/05/07 15:18:38, prateek wrote: > > Hi, > > > > I am trying to crawl URLs belonging to the same domain (around 140k) and > > because of the fact that all the same domain URLs go to the same mapper, > > only one mapper is used for fetching. All others are just a waste of > > resources. These are the configurations I have tried till now but it's > > still very slow. > > > > Attempt 1 - > > fetcher.threads.fetch : 10 > > fetcher.server.delay : 1 > > fetcher.threads.per.queue : 1, > > fetcher.server.min.delay : 0.0 > > > > Attempt 2 - > > fetcher.threads.fetch : 10 > > fetcher.server.delay : 1 > > fetcher.threads.per.queue : 3, > > fetcher.server.min.delay : 0.5 > > > > Is there a way to distribute the same domain URLs across all the > > fetcher.threads.fetch? I understand that in this case crawl delay cannot > be > > reinforced across different mappers but for my use case it's ok to crawl > > aggressively. So any suggestions? > > > > Regards > > Prateek > > >
Re: Crawling same domain URL's
Hi Prateek, mapred.map.tasks -->mapreduce.job.maps mapred.reduce.tasks -->mapreduce.job.reduces You should be able to override in these in nutch-site.xml then publish to your Hadoop cluster. lewismc On 2021/05/07 15:18:38, prateek wrote: > Hi, > > I am trying to crawl URLs belonging to the same domain (around 140k) and > because of the fact that all the same domain URLs go to the same mapper, > only one mapper is used for fetching. All others are just a waste of > resources. These are the configurations I have tried till now but it's > still very slow. > > Attempt 1 - > fetcher.threads.fetch : 10 > fetcher.server.delay : 1 > fetcher.threads.per.queue : 1, > fetcher.server.min.delay : 0.0 > > Attempt 2 - > fetcher.threads.fetch : 10 > fetcher.server.delay : 1 > fetcher.threads.per.queue : 3, > fetcher.server.min.delay : 0.5 > > Is there a way to distribute the same domain URLs across all the > fetcher.threads.fetch? I understand that in this case crawl delay cannot be > reinforced across different mappers but for my use case it's ok to crawl > aggressively. So any suggestions? > > Regards > Prateek >
Crawling same domain URL's
Hi, I am trying to crawl URLs belonging to the same domain (around 140k) and because of the fact that all the same domain URLs go to the same mapper, only one mapper is used for fetching. All others are just a waste of resources. These are the configurations I have tried till now but it's still very slow. Attempt 1 - fetcher.threads.fetch : 10 fetcher.server.delay : 1 fetcher.threads.per.queue : 1, fetcher.server.min.delay : 0.0 Attempt 2 - fetcher.threads.fetch : 10 fetcher.server.delay : 1 fetcher.threads.per.queue : 3, fetcher.server.min.delay : 0.5 Is there a way to distribute the same domain URLs across all the fetcher.threads.fetch? I understand that in this case crawl delay cannot be reinforced across different mappers but for my use case it's ok to crawl aggressively. So any suggestions? Regards Prateek