Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
Jorge , I think I spoke too soon , if I use the protocol-httpclient plugin , I am unable to fetch any page using the parsechecker. I get a [Fatal Error] :1:1: Content is not allowed in prolog. error. Are there any known issues with using protocol-httpclient , I am using Nutch 1.7 I have the following settings in my nutch-site.xml !-- Added based on the suggestion from nutch mailing list -- property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property property namehttp.useHttp11/name valuetrue/value descriptionNOTE: at the moment this works only for protocol-httpclient. If true, use HTTP 1.1, if false use HTTP 1.0 . /description /property Thanks. On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: The general answer is: it dependes, usually is polite to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed: ➜ local bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod contentType: text/html signature: 8e90c6d581f27c36828d433f746e4d7a - Url --- http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod - ParseData - Version: 5 Status: success(1,0) Title: Dressing for the Dark Outlinks: 151 outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor: outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor: ... (trimmed due length) As for the second one I wasn't able to do a test, the provided blocks access from my IP/country: This request is blocked by the SonicWALL Gateway Geo IP Service. Country Name:Cuba. Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting. Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic. Regards, - Original Message - From: Meraj A. Khan mera...@gmail.com To: user@nutch.apache.org Sent: Saturday, February 28, 2015 12:09:47 AM Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? Hi Jorge, Yes, I was exploring changing the http.agent.name property value in case where the sites either serve the mobile version or outright deny the request if no agent is specified. For example the following URL will give Request Rejected response if the User-Agent is not specified. http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod And the following URL will server a mobile version. http://www.techforless.com/cgi-bin/tech4less/60PN5000. So is it a good practice to set the http.agent.name to something like the below , to mimic a Chrome browser? Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: Hi Meraj, Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers knows how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows
Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
Thanks Jorge, I appreciate your help. On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: The general answer is: it dependes, usually is polite to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed: ➜ local bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod contentType: text/html signature: 8e90c6d581f27c36828d433f746e4d7a - Url --- http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod - ParseData - Version: 5 Status: success(1,0) Title: Dressing for the Dark Outlinks: 151 outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor: outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor: ... (trimmed due length) As for the second one I wasn't able to do a test, the provided blocks access from my IP/country: This request is blocked by the SonicWALL Gateway Geo IP Service. Country Name:Cuba. Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting. Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic. Regards, - Original Message - From: Meraj A. Khan mera...@gmail.com To: user@nutch.apache.org Sent: Saturday, February 28, 2015 12:09:47 AM Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? Hi Jorge, Yes, I was exploring changing the http.agent.name property value in case where the sites either serve the mobile version or outright deny the request if no agent is specified. For example the following URL will give Request Rejected response if the User-Agent is not specified. http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod And the following URL will server a mobile version. http://www.techforless.com/cgi-bin/tech4less/60PN5000. So is it a good practice to set the http.agent.name to something like the below , to mimic a Chrome browser? Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: Hi Meraj, Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers knows how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows about the downloaded HTML (similar to what you see when you view the source code of an HTML webpage) it doesn't know what a CSS style is, basically the crawler only is interested in: the links and the textual/binary content of the webpage, so when a page es fetched by Nutch, the HTML is downloaded but the other resources (fonts, styles, javascript) are not applied to the fetched page. Tweaking the http.agent.name property in the nutch-site.xml only will help with those sites that change what their response based on the user agent (one for mobile and other different for desktop browsers). This approach is being replaced by the responsive design, meaning that the user agent is not important for how the page is rendered. In the current trunk of the upcoming 1.10 version a plugin has been merged that could
Re: Can anyone fetch this page?
Can you please set the user agent to something that resembles a browser like Chrome for example and test? I just posted a query yesterday for a similar issue where the mobile version of the site gets served up instead of 500. On Fri, Feb 27, 2015 at 1:08 PM, Iain Lopata ilopa...@hotmail.com wrote: I get a 500. Have tried removing Nutch from my user-agent string and still get the same result. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, February 27, 2015 12:05 PM To: user@nutch.apache.org Subject: RE: Can anyone fetch this page? Seems fine to me http://oldservice.openindex.io/extract.php?url=http%3A%2F%2Fwww.nature.com%2Fnature%2Fjournal%2Fv518%2Fn7540%2Ffull%2Fnature14236.html -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Friday 27th February 2015 18:56 To: user@nutch.apache.org Subject: Can anyone fetch this page? Hi Folks, I was getting 500 internal server error using Nutch trunk when attempting to fetch content from this domain. http://www.nature.com Just for detail, Nature.com is a catalogue of journals and science resources, including the journal *Nature*. Publishes science news and articles across a wide range of scientific fields. So it is nothing malicious or sensitive/offending content-wise. Can anyone else fetch this URL? I can get it with curl and wget but not Nutch. Thanks Lewis -- *Lewis*
How to make Nutch 1.7 request mimic a browser?
In some instances the content that is downloaded in Fetch phase from a HTTP URL is not what you would get if you were to access the request from a well known browser like Google Chrome for example, that is because the server is expecting a user agent value that represents a browser. There is a http.agent.name property in nutch-site.xml, is it the same property that should be used to set the user agent to make the server respond to a Nutch get request the same way as it would for a request from a browser ? Or is there an another configurable property ? For example the user agent value for a Chrome browser is below. Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 Thanks.
NUTCH-762 Generate Multiple Segments
Hi Folks, I am facing the exact same problem that is described in JIRA NUTCH-762 , i.e the generate -updates takes excessive amount of time and the actual fetch only takes very less time compared to the generate time. The Jira issue commits a patch to allow generating of multiple segments in a single generate phase, however I was not able to do so . How can I generate multiple segments in a single generate phase ? I am using Nutch 1.7 , any help would be greatly appreciated. I am using Nutc 1.7 on YARN 2.3.0 Thanks.
Re: Depth option
Shadi, I am not sure what will be the case if example.com itself has external links,I think it will fetch those with depth 1,but if you want to disbale the fetching of external links , just set the external.links property to false,you dont need any url filter set up if you do so. On Jan 4, 2015 10:37 AM, Shadi Saleh propat...@gmail.com wrote: Thanks Adil, crawldb is not empty, now it contains old and current folder, should I clean it before I start new crawl? what is the proper way? Best On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: Yes, you are correct. no need to use the url filter. But this will work only if your crawldb remains empty. Regards Adil I. Abbasi On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh propat...@gmail.com wrote: Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578* -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578*
Re: Nutch running time
Shani, What is your Nutch version and which Hadoop version are you using , I was able to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make minor tweaks in the code. On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani shani.chau...@intel.com wrote: I'm running nutch distributed, on 3 nodes... I thought there is more configuration that I missed.. -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Thursday, January 01, 2015 18:28 To: user@nutch.apache.org Subject: Re: Nutch running time You need to run Nutch as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work. On Thu, Jan 1, 2015 at 8:32 AM, Chaushu, Shani shani.chau...@intel.com wrote: How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch running time It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property
Re: Question about db.default.fetch.interval.
Reposting my question. Hi All, I have a quick question regarding the db.default.fetch.interval parameter , I have currently set it to 15 days , however my crawl cycle itself is going beyond 15 days and upto 30 days , now I was not sure since I have set the db.default.fetch.interval to be only 15 days , is there a possibility that even before a complete crawl is completed , an already fetched page will get re-fetched before an un-fetched page is fetched and there by fetching less number of distinct pages. I guess, I am trying to know if setting the db.default.fetch.interval to a value less than the time it takes to do one complete crawl of the web will lead to some kind of infinite loop where the recently fetched pages will be re-fetched before the completely un-fetched ones because the value of the interval is less than the total crawl time ? Thanks. Thanks. On Sun, Dec 28, 2014 at 11:18 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I have a quick question regarding the db.default.fetch.interval parameter , I have currently set it to 15 days , however my crawl cycle itself is going beyond 15 days and upto 30 days , now I was not sure since I have set the db.default.fetch.interval to be only 15 days , is there a possibility that even before a complete crawl is completed , an already fetched page will get re-fetched before an un-fetched page is fetched and there by fetching less number of distinct pages. I guess, I am trying to know if db.default.fetch.interval be set to at-least be greater than one comprehensive crawl cycle time . Thanks.
Re: Nutch running time
It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. /description /property property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefetcher.threads.fetch/name value100/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). The total number of threads running in distributed mode will be the number of fetcher threads * number of nodes as fetcher has one map task per node. /description /property property namefetcher.queue.depth.multiplier/name value150/value description(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list is not optimal. /description /property property namefetcher.threads.per.queue/name value10/value descriptionThis number is the maximum number of threads that should be allowed to access a queue at one time. Setting it to a value 1 will cause the Crawl-Delay value from robots.txt to be ignored and the value of fetcher.server.min.delay to be used as a delay between successive requests to the same server instead of fetcher.server.delay. /description /property property namefetcher.server.min.delay/name value0.0/value descriptionThe minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking is turned off). /description /property property namefetcher.max.crawl.delay/name value5/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Re: nutch on amazon emr
I suggest running it using the stock bin/crawl script from the command line first and then try using the jar that you mentioned. On Jan 1, 2015 12:04 PM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: I tried to run it through custom jar step using script runner jar i.e. s3://elasticmapreduce/libs/script-runner/script-runner.jar Regards Adil I. Abbasi On Thu, Jan 1, 2015 at 8:51 PM, Meraj A. Khan mera...@gmail.com wrote: Can you give us the command that you use to start the crawl? On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: When I try to nutch crawl script on amazon emr, it gives me this error /mnt/var/lib/hadoop/steps/s-3VT1QRVSURPSH/./crawl: line 81: hdfs:///nutch/bin/nutch: No such file or directory Command exiting with ret '0' Though nutch script is located at hdfs:///nutch/bin/,still it gives this erorr. Any idea what is it that I'm doing wrong ? Regards Adil
Re: nutch on amazon emr
Can you give us the command that you use to start the crawl? On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: When I try to nutch crawl script on amazon emr, it gives me this error /mnt/var/lib/hadoop/steps/s-3VT1QRVSURPSH/./crawl: line 81: hdfs:///nutch/bin/nutch: No such file or directory Command exiting with ret '0' Though nutch script is located at hdfs:///nutch/bin/,still it gives this erorr. Any idea what is it that I'm doing wrong ? Regards Adil
Question about db.default.fetch.interval.
Hi All, I have a quick question regarding the db.default.fetch.interval parameter , I have currently set it to 15 days , however my crawl cycle itself is going beyond 15 days and upto 30 days , now I was not sure since I have set the db.default.fetch.interval to be only 15 days , is there a possibility that even before a complete crawl is completed , an already fetched page will get re-fetched before an un-fetched page is fetched and there by fetching less number of distinct pages. I guess, I am trying to know if db.default.fetch.interval be set to at-least be greater than one comprehensive crawl cycle time . Thanks.
Re: Nutch configuration - V1 vs V2 differences
I installed it by copying the files to conf directory, never tried without that step to confirm if the copying is really needed. On Nov 12, 2014 6:24 AM, mikejf12 i...@semtech-solutions.co.nz wrote: Hi I installed two version of Nutch on to a Centos 6 Linux Hadoop V1.2.1 cluster. I didnt have any issues in using them but I noticed a difference .. I installed the src version of apache-nutch-1.8-src, the instructions that I followed advised that the hadoop configuration files be copied to the nutch conf directory. I also installed the non source release apache-nutch-2.2.1, which didnt require this. Its been a while since I did this and I wondered whether the step to copy the hadoop config files was necessary for the src release ? cheers -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-configuration-V1-vs-V2-differences-tp4168893.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: When to delete the segments?
I am only indexing the parsed data in Solr , so there is no way for me to know when to delete a segment in an automated fashion by considering the parsed data alone, however I just relaized that there is a _SUCCESS file being created with in the segment once it is fetched. I will use that as an indicator to automate the deletion of the segment folders. On Mon, Nov 3, 2014 at 12:56 AM, remi tassing tassingr...@gmail.com wrote: If you are able to determine what is done with the parsed data, then you could delete the segment as soon as that job is completed. As I mentioned earlier, if the data is to be pushed to Solr (e.g. with bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT), then after indexing is done you can get rid of the segment On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan mera...@gmail.com wrote: Thanks . How do I definitively determine , if a segment has been completely parsed , if I were to set up a hourly crontab to delete the segments from HDFS? I have seen that the presence of the crawl_parse directory in the segments directory at least indicates that the parsing has started , but I think the directory would be created as soon as the parsing begins. So as to not delete the segments prematurely , while it is still being fetched , what should I be looking for in my script ? On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com wrote: The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than db.default.fetch.interval, but can delete it as soon as the segment is parsed. Is my understanding correct ? I want to delete the segment as soon as possible so as to save as much disk space as possible. Thanks.
Re: Reduce phase in Fetcher taking excessive time to finish.
Julien, Do we need to consider any data loss(URLs) in this scenario ? no, why? Thank you for confirming. J. On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj You can control the # of URLs per segment with property namegenerate.max.count/name value-1/value descriptionThe maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. /description /property property namegenerate.count.mode/name valuehost/value descriptionDetermines how the URLs are counted for generator.max.count. Default value is 'host' but can be 'domain'. Note that we do not count per IP in the new version of the Generator. /description /property the urls are grouped into inputs for the map tasks accordingly. Julien On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, On further analysis , I found that it was not a delay at reduce time , but a long running fetch map task , when I have multiple fetch map tasks running on a single segment , I see that one of the map tasks runs for a excessively longer period of time than the other fetch map tasks ,it seems this is happening because of the disproportionate distribution of urls per map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it seems its not guaranteed that each fetch map tasks will have 100,000 urls to fetch. Is is possible to set the an upper limit on the max number of URLs per fetch map task, along with the collective topN for the whole Fetch phase ? Thanks, Meraj. On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, What do the logs for the map tasks tell you about the URLs being fetched? J. On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, Thanks for your suggestion , I looked at the jstack thread dumps , and I could see that the fetcher threads are in a waiting state and actually the map phase is not yet complete looking at the JobClient console. 14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31% 14/10/16 07:11:20 INFO mapreduce.Job: map 96% reduce 31% 14/10/17 01:20:56 INFO mapreduce.Job: map 97% reduce 31% And the following is the kind of statements I see in the jstack thread dump for Hadoop child processes, is it possible that these map tasks are actually waiting on a particular host with some excessive crawl-delay , I already had the fetcher.threads.per.queue to 5 , fetcher.server.delay to 0, fetcher.max.crawl.delay to 10 and http.max.delays to 1000 . Please see the jstack log info for the child processes below. Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode): Attach Listener daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5 waiting on condition [0x] java.lang.Thread.State: RUNNABLE IPC Client (638223659) connection to /170.75.153.162:40980 from job_1413149941617_0059 daemon prio=10 tid=0x01a5c000 nid=0xce8 in Object.wait() [0x7fecdf80e000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f8bf48 (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899) - locked 0x99f8bf48 (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944) fetcher#5 daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in Object.wait() [0x7fecdf90f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) - locked 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) fetcher#4 daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in Object.wait() [0x7fecdfa1] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503
When to delete the segments?
Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than db.default.fetch.interval, but can delete it as soon as the segment is parsed. Is my understanding correct ? I want to delete the segment as soon as possible so as to save as much disk space as possible. Thanks.
Re: When to delete the segments?
Thanks . How do I definitively determine , if a segment has been completely parsed , if I were to set up a hourly crontab to delete the segments from HDFS? I have seen that the presence of the crawl_parse directory in the segments directory at least indicates that the parsing has started , but I think the directory would be created as soon as the parsing begins. So as to not delete the segments prematurely , while it is still being fetched , what should I be looking for in my script ? On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com wrote: The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than db.default.fetch.interval, but can delete it as soon as the segment is parsed. Is my understanding correct ? I want to delete the segment as soon as possible so as to save as much disk space as possible. Thanks.
bin/Crawl script loosing status updates from the MR job.
Hi All, I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN by redirecting its output to a log file as shown below. /opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 /tmp/nutch.log 21 The issue I am facing is that randomly this script when it is running a job looses track of the updates like Map 80% Reduce 67% and gets stuck there , and in the mean time the job completes successfully and the script is waiting there for further updates , as a result the looping of generate-fetch -update jobs gets terminated prematurely. This is so random that I am not able to figure out a particular pattern to this issue, and end up restarting the script every so often.Some times this happens in a job as short in duration as the inject phase of Nutch. Just wondering if anyone faced this issue ?Is the fact that I am redirecting the output to a logfile playing a part in this ? What are the best practices for running a long running script like bin/crawl ? I am using CentOs7.x Thanks.
Re: Reduce phase in Fetcher taking excessive time to finish.
Thanks for the info Julien.For the hypothetical example below topN 200,000 generate.max.count = 10,000 generate.count.mode = host If the number of hosts is 10 and let us assume that each one of those hosts has more than 10,000 unfetched URLs in CrawlDB , since we have set generate.max.count to 10,000 exactly 100,000 URLs would be fetched. Would the remaining URLs be fetched in the next phase cycle? Do we need to consider any data loss(URLs) in this scenario ? On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj You can control the # of URLs per segment with property namegenerate.max.count/name value-1/value descriptionThe maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. /description /property property namegenerate.count.mode/name valuehost/value descriptionDetermines how the URLs are counted for generator.max.count. Default value is 'host' but can be 'domain'. Note that we do not count per IP in the new version of the Generator. /description /property the urls are grouped into inputs for the map tasks accordingly. Julien On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, On further analysis , I found that it was not a delay at reduce time , but a long running fetch map task , when I have multiple fetch map tasks running on a single segment , I see that one of the map tasks runs for a excessively longer period of time than the other fetch map tasks ,it seems this is happening because of the disproportionate distribution of urls per map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it seems its not guaranteed that each fetch map tasks will have 100,000 urls to fetch. Is is possible to set the an upper limit on the max number of URLs per fetch map task, along with the collective topN for the whole Fetch phase ? Thanks, Meraj. On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, What do the logs for the map tasks tell you about the URLs being fetched? J. On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, Thanks for your suggestion , I looked at the jstack thread dumps , and I could see that the fetcher threads are in a waiting state and actually the map phase is not yet complete looking at the JobClient console. 14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31% 14/10/16 07:11:20 INFO mapreduce.Job: map 96% reduce 31% 14/10/17 01:20:56 INFO mapreduce.Job: map 97% reduce 31% And the following is the kind of statements I see in the jstack thread dump for Hadoop child processes, is it possible that these map tasks are actually waiting on a particular host with some excessive crawl-delay , I already had the fetcher.threads.per.queue to 5 , fetcher.server.delay to 0, fetcher.max.crawl.delay to 10 and http.max.delays to 1000 . Please see the jstack log info for the child processes below. Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode): Attach Listener daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5 waiting on condition [0x] java.lang.Thread.State: RUNNABLE IPC Client (638223659) connection to /170.75.153.162:40980 from job_1413149941617_0059 daemon prio=10 tid=0x01a5c000 nid=0xce8 in Object.wait() [0x7fecdf80e000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f8bf48 (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899) - locked 0x99f8bf48 (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944) fetcher#5 daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in Object.wait() [0x7fecdf90f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) - locked 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) fetcher#4 daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in Object.wait() [0x7fecdfa1] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method
Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.
Fred, In my last email on this topic , I mentioned that I am using a single segment and multiple fetch map tasks, and also the changes that I had to make to Nutch 1.7 to make it possible on YARN. Let me know if you cannot find it and I ll resend those again. Meraj. On Fri, Oct 24, 2014 at 4:17 PM, Fred frederic.luddeni+nab...@gmail.com wrote: Hi Mak, Please can you give me details on your changes please? I have the same issue. Thanks you in advance, Regards, -- View this message in context: http://lucene.472066.n3.nabble.com/Generate-multiple-segments-in-Generate-phase-and-have-multiple-Fetch-map-tasks-in-parallel-tp4161005p4165766.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Reduce phase in Fetcher taking excessive time to finish.
Julien, On further analysis , I found that it was not a delay at reduce time , but a long running fetch map task , when I have multiple fetch map tasks running on a single segment , I see that one of the map tasks runs for a excessively longer period of time than the other fetch map tasks ,it seems this is happening because of the disproportionate distribution of urls per map task, meaning if I have topN of 10,00,000 and 10 fetch map tasks , it seems its not guaranteed that each fetch map tasks will have 100,000 urls to fetch. Is is possible to set the an upper limit on the max number of URLs per fetch map task, along with the collective topN for the whole Fetch phase ? Thanks, Meraj. On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, What do the logs for the map tasks tell you about the URLs being fetched? J. On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, Thanks for your suggestion , I looked at the jstack thread dumps , and I could see that the fetcher threads are in a waiting state and actually the map phase is not yet complete looking at the JobClient console. 14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31% 14/10/16 07:11:20 INFO mapreduce.Job: map 96% reduce 31% 14/10/17 01:20:56 INFO mapreduce.Job: map 97% reduce 31% And the following is the kind of statements I see in the jstack thread dump for Hadoop child processes, is it possible that these map tasks are actually waiting on a particular host with some excessive crawl-delay , I already had the fetcher.threads.per.queue to 5 , fetcher.server.delay to 0, fetcher.max.crawl.delay to 10 and http.max.delays to 1000 . Please see the jstack log info for the child processes below. Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode): Attach Listener daemon prio=10 tid=0x7fecf8c58000 nid=0x32e5 waiting on condition [0x] java.lang.Thread.State: RUNNABLE IPC Client (638223659) connection to /170.75.153.162:40980 from job_1413149941617_0059 daemon prio=10 tid=0x01a5c000 nid=0xce8 in Object.wait() [0x7fecdf80e000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f8bf48 (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:899) - locked 0x99f8bf48 (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:944) fetcher#5 daemon prio=10 tid=0x7fecf8c49000 nid=0xce7 in Object.wait() [0x7fecdf90f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) - locked 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) fetcher#4 daemon prio=10 tid=0x7fecf8c47000 nid=0xce6 in Object.wait() [0x7fecdfa1] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) - locked 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) fetcher#3 daemon prio=10 tid=0x7fecf8c45800 nid=0xce5 in Object.wait() [0x7fecdfb11000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.getHost(ShuffleSchedulerImpl.java:368) - locked 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:161) fetcher#2 daemon prio=10 tid=0x7fecf8c31800 nid=0xce4 in Object.wait() [0x7fecdfc12000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at java.lang.Object.wait(Object.java:503
Re: Reduce phase in Fetcher taking excessive time to finish.
=0xc69 waiting on condition [0x] java.lang.Thread.State: RUNNABLE Signal Dispatcher daemon prio=10 tid=0x7fecf8095000 nid=0xc68 runnable [0x] java.lang.Thread.State: RUNNABLE Finalizer daemon prio=10 tid=0x7fecf807e000 nid=0xc60 in Object.wait() [0x7fecec83c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99a1e040 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked 0x99a1e040 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189) Reference Handler daemon prio=10 tid=0x7fecf807a000 nid=0xc5f in Object.wait() [0x7fecec93d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99aeb3e0 (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:503) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) - locked 0x99aeb3e0 (a java.lang.ref.Reference$Lock) main prio=10 tid=0x7fecf800f800 nid=0xc4f in Object.wait() [0x7fed00948000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.waitUntilDone(ShuffleSchedulerImpl.java:443) - locked 0x99f62a68 (a org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl) at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:129) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) VM Thread prio=10 tid=0x7fecf8077800 nid=0xc5e runnable GC task thread#0 (ParallelGC) prio=10 tid=0x7fecf8025800 nid=0xc54 runnable GC task thread#1 (ParallelGC) prio=10 tid=0x7fecf8027000 nid=0xc55 runnable GC task thread#2 (ParallelGC) prio=10 tid=0x7fecf8029000 nid=0xc56 runnable GC task thread#3 (ParallelGC) prio=10 tid=0x7fecf802b000 nid=0xc57 runnable GC task thread#4 (ParallelGC) prio=10 tid=0x7fecf802c800 nid=0xc58 runnable GC task thread#5 (ParallelGC) prio=10 tid=0x7fecf802e800 nid=0xc59 runnable GC task thread#6 (ParallelGC) prio=10 tid=0x7fecf8030800 nid=0xc5a runnable GC task thread#7 (ParallelGC) prio=10 tid=0x7fecf8032800 nid=0xc5b runnable VM Periodic Task Thread prio=10 tid=0x7fecf80af800 nid=0xc6c waiting on condition JNI global references: 255 On Thu, Oct 16, 2014 at 5:20 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj You could call jstack on the Java process a couple of times to see what it is busy doing, that will be a simple of way of checking that this is indeed the source of the problem. See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible solution J. On 16 October 2014 06:08, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running into a situation where the reduce phase of the fetch job with parsing enabled at the time of fetch is taking excessively long amount of time , I have seen recommendations to filter the URLs based on length to avoid normalization related delays ,I am not filtering any URLs based on length , could that be an issue ? Can anyone share if they faced this issue and what the resolution was, I am running Nutch 1.7 on Hadoop YARN. The issue was previously inconclusively discussed here. http://markmail.org/message/p6dzvvycpfzbaugr#query:+page:1+mid:p6dzvvycpfzbaugr+state:results Thanks. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Reduce phase in Fetcher taking excessive time to finish.
Hi All, I am running into a situation where the reduce phase of the fetch job with parsing enabled at the time of fetch is taking excessively long amount of time , I have seen recommendations to filter the URLs based on length to avoid normalization related delays ,I am not filtering any URLs based on length , could that be an issue ? Can anyone share if they faced this issue and what the resolution was, I am running Nutch 1.7 on Hadoop YARN. The issue was previously inconclusively discussed here. http://markmail.org/message/p6dzvvycpfzbaugr#query:+page:1+mid:p6dzvvycpfzbaugr+state:results Thanks.
Re: Generated Segment Too Large
Markus, I have been using Nucth for a while , but I wasnt clear about this issue, thank you for reminding me that this is Nucth 101 :) I will go ahead and use topN as the segment size control mechanism, although I have one question regarding topN , i.e if I have topN value of 1000 and if there are more than topN , lets say 2000 URLs that are unfetched at that point of time , the remaining 1000 would be addressed in the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ? On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - you have been using Nutch for some time already so aren't you already familiar with generate.max.count configuration directive possibly combined with the -topN parameter for the Generator job? With generate.max.count the segment size depends on the number of distinct hosts or domains so it is not really trustworthy, the topN parameter is really strict. Markus -Original message- From:Meraj A. Khan mera...@gmail.com Sent: Tuesday 7th October 2014 5:54 To: user@nutch.apache.org Subject: Generated Segment Too Large Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out files for this large segment which gets fetched for days ,I am running out of disk space. I figured , if the segment size were to be controlled then for each segment the spills files would be deleted after the job for that segment was completed, giving me a efficient use of the disk space. I would like to know how I can generate multiple segments of a certain size (or just fixed number )at each depth iteration . Right now , looks like the Generator.java does needs to be modified as it does not consider the number of segments , is that the right approach ? if so can you please give me a few pointers of what logic I should be changing , if this is not the right approach I would be happy to know if there is any way to control , the number as well as the size of the generated segments using the configuration/job submission parameters. Thanks for your help!
Generated Segment Too Large
Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out files for this large segment which gets fetched for days ,I am running out of disk space. I figured , if the segment size were to be controlled then for each segment the spills files would be deleted after the job for that segment was completed, giving me a efficient use of the disk space. I would like to know how I can generate multiple segments of a certain size (or just fixed number )at each depth iteration . Right now , looks like the Generator.java does needs to be modified as it does not consider the number of segments , is that the right approach ? if so can you please give me a few pointers of what logic I should be changing , if this is not the right approach I would be happy to know if there is any way to control , the number as well as the size of the generated segments using the configuration/job submission parameters. Thanks for your help!
Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.
Just wanted to update and let everyone know that this issue with single map task for fetch was occurring because Generator.java had logic around MRV1 property *mapred.job.tracker*, I had to change that logic and as I am running this on YARN and now multiple fetch tasks operate on a single segment. Also I misunderstood that multiple segments would need to be generated to achieve parallelism , it does not seem to be the case , parallelism at fetch time is achieved by having multiple fetch tasks operate on a single segment. Thanks everyone for your help on resolving this issue. On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan mera...@gmail.com wrote: Folks, As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN cluster . In order to scale I would need to Fetch concurrently with multiple map tasks on multiple nodes ,I think that the first step to do so would be to generate multiple segments in the generate phase so that multiple fetch map tasks can operate in parallel and in order to generate multiple segments at Generate time I have made the following changes , but unfortunately I have been unsuccessful in doing so. I have tweaked the following parameters in bin/crawl to do so . added the *maxNumSegments* and *numFetchers* parameters in the call to generate in *bin/crawl *script as can be seen below. *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter* (Here $numFetchers has a value of 15) The *generate.max.count* and *generate.count.mode* and *topN* are all default values , meaning I am not providing any values for them. Also the crawldb status before the Generate phase is as shown below , it shows that the number of unfetched URLs is more than *75 million* , so its not that there are not enough urls for Generate to generate multiple segments. * CrawlDB status* * db_fetched=318708* * db_gone=4774* * db_notmodified=2274* * db_redir_perm=2253* * db_redir_temp=2527* * db_unfetched=7524* However I do see this message in the logs consistently during the generate phase. *Generator: jobtracker is 'local', generating exactly one partition.* is this one partition referring to the the single segment that is going to be generated ? If so how do I address this. I feel like I have exhausted all the options but I am unable to have the Generate phase generate more than one segment at a time. Can someone let me know if there is anything else that I should be trying here ? *Thanks and any help is much appreciated!*
Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.
Folks, As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN cluster . In order to scale I would need to Fetch concurrently with multiple map tasks on multiple nodes ,I think that the first step to do so would be to generate multiple segments in the generate phase so that multiple fetch map tasks can operate in parallel and in order to generate multiple segments at Generate time I have made the following changes , but unfortunately I have been unsuccessful in doing so. I have tweaked the following parameters in bin/crawl to do so . added the *maxNumSegments* and *numFetchers* parameters in the call to generate in *bin/crawl *script as can be seen below. *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter* (Here $numFetchers has a value of 15) The *generate.max.count* and *generate.count.mode* and *topN* are all default values , meaning I am not providing any values for them. Also the crawldb status before the Generate phase is as shown below , it shows that the number of unfetched URLs is more than *75 million* , so its not that there are not enough urls for Generate to generate multiple segments. * CrawlDB status* * db_fetched=318708* * db_gone=4774* * db_notmodified=2274* * db_redir_perm=2253* * db_redir_temp=2527* * db_unfetched=7524* However I do see this message in the logs consistently during the generate phase. *Generator: jobtracker is 'local', generating exactly one partition.* is this one partition referring to the the single segment that is going to be generated ? If so how do I address this. I feel like I have exhausted all the options but I am unable to have the Generate phase generate more than one segment at a time. Can someone let me know if there is anything else that I should be trying here ? *Thanks and any help is much appreciated!*
Re: get generated segments from step / fetch all empty segments
Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail
RE: get generated segments from step / fetch all empty segments
Markus, I have used the maxnum segments but no luck, is it driven by the size of the segment instead ? On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote: You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- From:Edoardo Causarano edoardo.causar...@gmail.com Sent: Monday 22nd September 2014 15:25 To: user@nutch.apache.org Subject: Re: get generated segments from step / fetch all empty segments Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail -- Edoardo Causarano Sent with Airmail
RE: get generated segments from step / fetch all empty segments
Thanks Markus, is that enough driven by the HDFS block size? Edoardo, sorry for hijacking your thread. :( On Sep 22, 2014 9:35 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - It will only generate more segments when there are enough URL's to generate combined with either topN or generate.count.mode and generate.max.count. -Original message- From:Meraj A. Khan mera...@gmail.com Sent: Monday 22nd September 2014 15:33 To: user@nutch.apache.org Subject: RE: get generated segments from step / fetch all empty segments Markus, I have used the maxnum segments but no luck, is it driven by the size of the segment instead ? On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote: You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- From:Edoardo Causarano edoardo.causar...@gmail.com Sent: Monday 22nd September 2014 15:25 To: user@nutch.apache.org Subject: Re: get generated segments from step / fetch all empty segments Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail -- Edoardo Causarano Sent with Airmail
Re: Running multiple fetch map tasks on a Hadoop Cluster.
Julien, How would you achieve parallelism then on a Hadoop cluster , am I missing something here? My understanding was that we could scale the crawl by allowing fetch to happen in multiple map tasks in multiple nodes in a Hadoop cluster , otherwise I am stuck in sequentially crawling a large set of urls spread across mutiple domains. If that is indeed the way to scale the crawl , then we would need to generate multiple segments at the generate time so that these could be fetched in paralle. So I guess I really need help in . 1. Making the generate phase generate multiple segments 2. Being able to fetch these segments in parallel. Can you please let me know if my approach to scale the crawl sounds right to you ? Thanks and much appreciated, all the help I have gotten so far On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: The fetching operates segment by segment and won't fetch more than one at the same time. You can get the generation step to build multiple segments in one go but you'd need to modify the script so that the fetching step is called as many times as you have segments + you'd probably need to add some logic for detecting that they've all finished before you move on to the update step. Out of curiosity : why do you want to fetch multiple segments at the same time? On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote: Hello Folks, I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN. Based on Julien's suggestion I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so. 1. Added maxNumSegments and numFetchers parameters to the generate phase. $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter 2. Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch. $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing# The generate phase is not generating more than one segment. And as a result the fetch phase is not creating multiple map tasks, also I belive the way the script is written it does not allow the fecth to fecth multiple segements in parallel even if the generate were to generate multiple segments. Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used? Thanks. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Running multiple fetch map tasks on a Hadoop Cluster.
Jake, I am not sure how to make that happen, every time I run the nutch 1.7 job on YARN , I see a single segment being generated a nd a single map task bein launched,underutilizing the capacity of the cluster and slowing the crawl. Are you suggesting I should be seeing multiple fetch map tasks for a single segment, if so I am not. Thanks. On Sep 19, 2014 5:13 PM, Jake Dodd j...@ontopic.io wrote: Hi Meraj, Nutch and Hadoop abstract all of that for you, so you don’t need to worry about it. When you execute the fetch command for a segment, it will be parallelized across the nodes in your cluster. Cheers Jake On Sep 19, 2014, at 1:52 PM, Meraj A. Khan mera...@gmail.com wrote: Julien, How would you achieve parallelism then on a Hadoop cluster , am I missing something here? My understanding was that we could scale the crawl by allowing fetch to happen in multiple map tasks in multiple nodes in a Hadoop cluster , otherwise I am stuck in sequentially crawling a large set of urls spread across mutiple domains. If that is indeed the way to scale the crawl , then we would need to generate multiple segments at the generate time so that these could be fetched in paralle. So I guess I really need help in . 1. Making the generate phase generate multiple segments 2. Being able to fetch these segments in parallel. Can you please let me know if my approach to scale the crawl sounds right to you ? Thanks and much appreciated, all the help I have gotten so far On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: The fetching operates segment by segment and won't fetch more than one at the same time. You can get the generation step to build multiple segments in one go but you'd need to modify the script so that the fetching step is called as many times as you have segments + you'd probably need to add some logic for detecting that they've all finished before you move on to the update step. Out of curiosity : why do you want to fetch multiple segments at the same time? On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote: Hello Folks, I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN. Based on Julien's suggestion I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so. 1. Added maxNumSegments and numFetchers parameters to the generate phase. $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter 2. Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch. $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing# The generate phase is not generating more than one segment. And as a result the fetch phase is not creating multiple map tasks, also I belive the way the script is written it does not allow the fecth to fecth multiple segements in parallel even if the generate were to generate multiple segments. Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used? Thanks. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Fetch Job Started Failing on Hadoop Cluster
Markus, Thanks, the issue I was setting the PATH variable in the bin/crawl script and once I removed it and set it outside of the bin/crawl script , it started working fine now. On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - you made Nutch believe that hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a segment, but it is not. So either no segment was created or written to the wrong location. I don't know what kind of script you are using but you should check the return code of the generator, if gives a -1 for no segment created. Markus -Original message- From:Meraj A. Khan mera...@gmail.com mailto:mera...@gmail.com Sent: Monday 15th September 2014 7:02 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: Fetch Job Started Failing on Hadoop Cluster Hello Folks, My Nutch crawl which was running fine , started failing in the first Fetch Job/Application, I am unable to figure out whats going on here, I have attached the last snippet of the log below , can some please let me know whats going on here ? What I noticed is that even though the generate phase created a segment 20140915004940 , the fetch phase is only looking up to the segments directory for the segments. Thanks. 14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15 00:50:07, elapsed: 00:00:59 ls: cannot access crawldirectory/segments/: No such file or directory Operating on segment : Fetching : 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15 00:50:09 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment: crawldirectory/segments 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1410767409664 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c libfile', or link it with '-z noexecstack'. 14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at server1.mydomain.com/170.75.152.162:8040 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at server1.mydomain.com/170.75.152.162:8040 14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010 14/09/15 00:50:12 WARN security.UserGroupInformation: PriviledgedActionException as:df (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate 14/09/15 00:50:12 WARN security.UserGroupInformation: PriviledgedActionException as:df (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate 14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108) at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833) at
Re: Nutch 1.7 fetch happening in a single map task.
AFAIK, the script does not go by the mode you set , but the presence of the *nutch*.job file in the a directory a level above script it self i. ../*.job. Can you please check if you have the Hadoop job file at the appropriate location? On Mon, Sep 8, 2014 at 9:22 AM, Simon Z simonz.nu...@gmail.com wrote: Thank you very Meraj for your reply, I also thought it's a typo. I had set the numFetchers via numSlaves, and the echo of generator showed that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2), but the output of generator showed that the run mode is local and generate exact one mapper, although I had changed mode=distributed, any idea about this please? Many regards, Simon On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan mera...@gmail.com wrote: I think that is a typo , and it is actually CrawlDirectory. For the single map task issue although I have not tried it yet,but we can control the number of fetchers by numFetchers parameter when doing the generate via the bin/generate. On Sep 7, 2014 9:23 AM, Simon Z simonz.nu...@gmail.com wrote: Hi Julien, What do you mean by crawlID please? I am using nutch 1.8 and follow the instruction in the tutorial as mentioned before, and seems have a similar situation, that is, fetch runs on only one map task. I am running on a cluster of four nodes on hadoop 2.4.1. Notice that the map task can be assigned to any node, but only one map each round. I have set numSlaves=4 mode=distributed The seed url list includes five different websites from different host. Is there any settings I missed out? Thanks in advance. Regards, Simon On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote: Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file params Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl seedDir crawlID solrURL numberOfRounds On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote: Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien
Re: Nutch 1.7 fetch happening in a single map task.
I think that is a typo , and it is actually CrawlDirectory. For the single map task issue although I have not tried it yet,but we can control the number of fetchers by numFetchers parameter when doing the generate via the bin/generate. On Sep 7, 2014 9:23 AM, Simon Z simonz.nu...@gmail.com wrote: Hi Julien, What do you mean by crawlID please? I am using nutch 1.8 and follow the instruction in the tutorial as mentioned before, and seems have a similar situation, that is, fetch runs on only one map task. I am running on a cluster of four nodes on hadoop 2.4.1. Notice that the map task can be assigned to any node, but only one map each round. I have set numSlaves=4 mode=distributed The seed url list includes five different websites from different host. Is there any settings I missed out? Thanks in advance. Regards, Simon On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote: Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file params Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl seedDir crawlID solrURL numberOfRounds On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote: Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks. -- Open Source Solutions for Text Engineering http
Re: Nutch 1.7 fetch happening in a single map task.
Julien, Thank you for the decisive advice,using the crawl script seems to have solved the problem of abrupt termination of the crawl , the bin/crawl script respects the depth and topN parameters and iterates accordingly. However , I have an issue with the number of maps that are being used for fetch phase , its always 1 , I see that the script sets the numFetchers parameters at the time of generate phase euqal to the number of slaves which is 3 in my case , however only a single map task is being used, under-utilizing my Hadoop cluster and slowing down the crawl . I see that in the Crawldb update phase there millions on 'db_unfetched' urls still the generate phase only creates a single segment with about 20-30k urls and as a result only a single map tasks is being used for the fetch phase, I guess I need to make the generate phase generate more segments than one , how do I do that using the bin/crawl script. Please note that this is for Nutch 1.7 on Hadoop 2.3.0. Thanks. On Fri, Aug 29, 2014 at 10:39 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: No, just do 'bin/crawl seedDir crawlID solrURL numberOfRounds' from the master node. It internally calls the nutch script for the individual commands, which takes care of sending the job jar to your hadoop cluster, see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271 On 29 August 2014 15:24, S.L simpleliving...@gmail.com wrote: Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file params Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl seedDir crawlID solrURL numberOfRounds On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU ( http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L simpleliving...@gmail.com wrote: Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com
Re: Nutch 1.7 fetch happening in a single map task.
Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any pointers you can share? Thanks. On Aug 29, 2014 4:40 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Nutch 1.7 fetch happening in a single map task.
Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier phase). How can I force Nutch to do fetch in multiple map tasks , is there a setting to force more than one reducers in the generate-partition job to have more map tasks ?. Please also note that I have commented out the code in Crawl.java to not do the LInkInversion phase as , I dont need the scoring of the URLS that Nutch crawls, every URL is equally important to me. Thanks.
Nutch 1.7 on Hadoop Yarn 2.3.0 performing only 3 rounds of fetching.
Hi All, After spending some time on this I was able to diagnose the problem that when I submit the Nutch 1.7 job to a Hadoop Yarn Cluster , I notice that in the Hadoop UI , it lists the tasks that its executing , only 3 rounds of fetch happen , even though I have given a depth on 100 and my seed list has 10 URLs . Any idea why this is happening ? Please note when I run the same Nutch configuration in my local mode i.e in eclipse it does appropriate number of fetches and also fetches all the URLs from all the domains. Thanks in advance!
Re: Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.
Perfect, thank you Julien! On Thu, Jun 26, 2014 at 10:21 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: If I set fetcher.threads.per.queue property to more than 1 , I believe the behavior would be to have those many number of threads per host from Nutch, in that case would Nutch still respect the Crawl-Delay directive in robots.txt and not crawl at a faster pace that what is specified in robots.txt. In short what I am trying to ask is if setting fetcher.threads.per.queue to 1 is required for being as polite as Crawl-Delay in robots.txt expects? Using more than 1 thread per queue will ignore any crawl-delay obtained from robots.txt (see https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java#L317 ) and use the fetcher.server.min.delay configuration which has a default value of 0. So yes, setting fetcher.threads.per.queue to 1 is required for being as polite as Crawl-Delay in robots.txt expects. HTH Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Please share your experience of using Nutch in production
Hello Folks, I have noticed that Nutch resources and mailing lists are mostly geared towards the usage of Nutch in research oriented projects , I would like to know from those of you who are using Nutch in production for large scale crawling (vertical or non-vertical) about what challenges to expect and how to overcome them. I will list a few challenges that I faced below and would like to hear from if you faced these challenges you on how you overcame these. 1. If I were to go for a vertical search engine for websites in a particular domain and follow the crawl-delay directive for politeness in the robots.txt , there is a possibility that the web master could still block my IP address and I start getting HTTP 403 forbidden/access denied messages. How can I overcome these kind of issues , other than providing full contact info in the nutch-site.xml for the web master to get in touch with me, before blocking me ?. 2. The fact that you will be considered as just another Nutch variant by web master puts you at a great level of dis-advantage , where you could be blocked from accessing the web site at the whims of the web master. 3. Can anyone share info as to how they overcame this issue when they were starting out , did you establish a relationship with each website owner/master to allows unhindered access ? 4. Any other tips and suggestions would also be greatly appreciated. Thanks.
Re: Please share your experience of using Nutch in production
Gora, Thanks for sharing your admin perspective , rest assured I am not trying to circumvent any politeness requirements in any way , as I mentioned earlier , I am with in the crawl-delay limits that are being set by the web masters if any , however , you have confirmed my hunch that I might have to reach out to individual webmasters to try and convince them to not block my IP address . Even if I have as small a number as 100 web sites to crawl , it would be a huge challenge for us to communicate with each and every webmaster , how would one go about doing that ? Also is there a standard way the web masters list their contact info so as to sell them the pitch to or persuade them to allows us to crawl their websites at a reasonable frequency? By being at a disadvantage , I meant at a disadvantage compared to major players like Google, Bing and Yahoo bots , whom the webmasters probably would not block access, and by Nutch variant , I meant an instance of a customized crawler based on Nutch. Thanks. On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty g...@mimirtech.com wrote: On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote: Hello Folks, I have noticed that Nutch resources and mailing lists are mostly geared towards the usage of Nutch in research oriented projects , I would like to know from those of you who are using Nutch in production for large scale crawling (vertical or non-vertical) about what challenges to expect and how to overcome them. I will list a few challenges that I faced below and would like to hear from if you faced these challenges you on how you overcame these. 1. If I were to go for a vertical search engine for websites in a particular domain and follow the crawl-delay directive for politeness in the robots.txt , there is a possibility that the web master could still block my IP address and I start getting HTTP 403 forbidden/access denied messages. How can I overcome these kind of issues , other than providing full contact info in the nutch-site.xml for the web master to get in touch with me, before blocking me ?. Er, providing full access info. is just basic politeness, and IMHO should become a requirement for Nutch. If you are going to hit some sites particularly hard, with good reasons, try contacting the website administrators and explaining to them why you need such access. We both administer, and crawl sites, and as an administrator I am quite willing to accept reasonable requests. After all, it is also our goal to promote our websites, and already most traffic on the web is through search engines. 2. The fact that you will be considered as just another Nutch variant by web master puts you at a great level of dis-advantage , where you could be blocked from accessing the web site at the whims of the web master. Not sure what you mean by just another Nutch variant, nor why you think that it puts you at a disadvantage. Disadvantage compared to whom? Also, whims of the web master? Really? After all, it is their resources that you are using, and they are perfectly within their rights to ban you if they feel, for whatever reason, that you are abusing such resources. 3. Can anyone share info as to how they overcame this issue when they were starting out , did you establish a relationship with each website owner/master to allows unhindered access ? 4. Any other tips and suggestions would also be greatly appreciated. Sorry if I am misreading the above, but what you are asking for smells like trying to circumvent reasonable requirements. Yes, do try talking to website administrators. You might find them to be surprisingly accommodating if you are reasonable in return. Regards, Gora
Re: Relationship between fetcher.threads.fetch and fetcher.threads.per.host
Sebastian, Thanks for the clear explanation , I have a similar questions . 1. If I set the fetcher.threads.per.host or the renamed fetcher.threads.per.queue property to more than the edefault 1 , would my cralwer still be with in the crawl-delay limits for each host as specified in its robots.txt ? 2. Looks like the max value we set in fetcher.threads.per.host value only comes into play when the total number of threads for the map task are less than the value we specify in the fetcher.threads.fetch property ? Thanks. On Sun, Jun 22, 2014 at 2:13 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, 1. fetcher.threads.per.host: 10*3 = 30 Correct. But if there are 1000 hosts you hardly would set it to 3000, see question 2. Keep in mind, that the property has been renamed into fetcher.threads.per.queue with Nutch 1.4! A queue can be defined by host or ip, see fetcher.queue.mode. 2. fetcher.threads.fetch If there are many hosts you would set fetcher.threads.per.host to 1 (the default), and use fetcher.threads.fetch to limit the load on your system (esp. to limit the network load). 3. in distributed mode All URLs from the same host are placed in the same partition. This ensures that host-level blocking can be done in one single JVM. Sebastian On 06/22/2014 05:51 PM, S.L wrote: Hi All, I would like to know the relationship between the two config properties *fetcher.threads.fetch* and *fetcher.threads.per.host*. 1. If lets say I am crawling 10 hosts in my seed file and set the fetcher.threads.per.host property to 3 , should I set the fetcher.threads.fetch property to 10*3 i.e =30 ? 2. I can understand the *fetcher.threads.per.host *property as it is self explanatory , which means number to concurrent connections to a particular host , however , I am not able to clearly follow what *fetcher.threads.fetch *does. 3. Also I would like to know how the *fetcher.threads.per.host* property comes into play in a distributed mode ? Thanks in advance.