Nutch selenium
Hi, We are trying to run Nutch with selenium and getting error as "GDK_BACKEND does not match available displays" . We tried a lot to reslove this. can anyone help on this am getting this error only when i run Nutch in Hadoop cluster. It is working perfectly in standalone. Error org.openqa.selenium.firefox.NotConnectedException: Unable to connect to host localhost on port 7057 after 45000 ms. Firefox console output: Error: GDK_BACKEND does not match available displays at org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:113) at org.openqa.selenium.firefox.FirefoxDriver.startClient(FirefoxDriver.java:271) at org.openqa.selenium.remote.RemoteWebDriver.(RemoteWebDriver.java:119) at org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:216) at org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:211) Thanks & Regards Deepa Devi Jayaveer =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
NoRouteToHostException in 2 node cluster
Hi , When we try to run nutch with 2 node cluster , am getting NoRouteToHostException. can you please help to get to resolve this. Thanks & Regards Deepa Devi Jayaveer =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: Nutch 2.4 -Hadoop2 -mysql compatibility
Hi , can you please help on this. Is latest version of Gora doesn't support for RDBMS? am trying to run Nutch 2.4 in distributed environment with MySQL as a database. am facing an issue that webpage schema is not getting created in the database. It is working fine with HBase. can you please let me know the compatibility of Nutch 2.4 with MySql? From: Deepa Jayaveer/CHN/TCS To: user@nutch.apache.org Date: 25-02-2016 16:01 Subject:Nutch 2.4 -Hadoop2 -mysql compatibility Hi am trying to run Nutch 2.4 in distributed environment with MySQL as a database. am facing an issue that webpage schema is not getting created in the database. It is working fine with HBase. can you please let me know the compatibility of Nutch 2.4 with MySql? Thanks & Regards Deepa Devi Jayaveer =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Nutch 2.4 -Hadoop2 -mysql compatibility
Hi am trying to run Nutch 2.4 in distributed environment with MySQL as a database. am facing an issue that webpage schema is not getting created in the database. It is working fine with HBase. can you please let me know the compatibility of Nutch 2.4 with MySql? Thanks & Regards Deepa Devi Jayaveer =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
nutch hbase error
Hi, I tried to integrate Nutch with Hbase. I have used the given versions Nutch - 2.3 Hadoop Version - 1.0.1 HBase Version - 0.94.14 Zookeeper - 3.4.5 Please let me know the versions used are correct or I have to upgrade or downgrade the versions. can anybody help to fix the given error. Wen i try to run , am getting the following exception **Log*** 2015-06-25 17:03:07,169 ERROR crawl.GeneratorJob - GeneratorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times 2015-06-25 17:19:51,302 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times Zookeeper log shows the IP address of the system in which I am running Nutch with HBase. My IP is *** 2015-06-25 19:30:20,976 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /***:*** 2015-06-25 19:30:20,976 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to establish new session at /***:*** 2015-06-25 19:30:20,993 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session --- with negotiated timeout 18 for client /***:*** =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: http 501 error
Thanks a lot for your response. will Nutch can handle POST request? Thanks Deepa From: Gora Mohanty To: user@nutch.apache.org Date: 11-06-2015 15:23 Subject:Re: http 501 error Hi, A HTTP 501 error is a method not implemented error, as you could have searched and found out. What that means is that the server you are trying to crawl does not implement GET for that URL. Regards, Gora On 11 June 2015 at 14:37, Deepa Jayaveer wrote: > Hi All, > > When I try to crawl the website ,am getting the Http response code as 501 > . while debugging , found > out that the error occurred when the following code executed in > HttpResponse.java > > GetMethod get = new GetMethod(url.toString()); > int code = httpClient.executeMethod(get); > > code returns 501 . Do i need to change anything on Httpclient program > can you please help to fix this > > Thanks > Deepa > =-=-= > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > >
http 501 error
Hi All, When I try to crawl the website ,am getting the Http response code as 501 . while debugging , found out that the error occurred when the following code executed in HttpResponse.java GetMethod get = new GetMethod(url.toString()); int code = httpClient.executeMethod(get); code returns 501 . Do i need to change anything on Httpclient program can you please help to fix this Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: [MASSMAIL]dynamic content from the web pages
Thanks for your mail. yes, the different prices are loaded based on the size and it is dynamically loaded using javascript. Am using Nutch 2.1 . will nutch selenium resolve the issue? Thanks and Regards Deepa From: Jorge Luis Betancourt González To: user@nutch.apache.org Date: 08-06-2015 18:32 Subject:Re: [MASSMAIL]dynamic content from the web pages I think you need to specify a little more of details how the different prices are loaded in the site? Is this dynamically loaded using javascript when you select the size from a select? If this is the case one way to go is using the nutch-selenium plugin, what Nutch version are you using? Regards, - Original Message - From: "Deepa Jayaveer" To: user@nutch.apache.org Sent: Monday, June 8, 2015 5:17:27 AM Subject: [MASSMAIL]dynamic content from the web pages Hi All, How to retrieve the dynamic content from the web pages. Say, if I want to retrieve prices of shoes for different sizes from the shopping web site? As the web pages wont be different from various shoe sizes then no clue on how to retrieve. Any help? Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
dynamic content from the web pages
Hi All, How to retrieve the dynamic content from the web pages. Say, if I want to retrieve prices of shoes for different sizes from the shopping web site? As the web pages wont be different from various shoe sizes then no clue on how to retrieve. Any help? Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
ret Errror HTTP 307
Hi All, When we try to crawl a web page , getting an error response as the following. But I tried the same page in JSoup , it is working fine. can you please let us know the reason HTTP/1.1 307 Authentication Required Date: Thu, 19 Mar 2015 04:22:46 GMT Proxy-Connection: close Via: 1.1 localhost.localdomain Cache-Control: no-store Content-Type: text/html Content-Language: en Location: xxx Connection: close Content-Length: 243 Thanks and Regards Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
reg crawled pages with status=2
Hi, our requirement is that the Nutch should not recrawl crawl the pages that was being already crawled. ie., the crawling should not happen for the web pages with the status as '2' in the webpage table. It should not recrawl and should not put the outlinks as well. can you please let me know whether it is possible by changing some configuration parameters in nutch site xml? Thanks and Regards Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
setting up depth and topN dynamically
Hi, I need to crawl around 3 URLs per day. In which i need to set dynamic depth and topN .Is there any configuration where i can set up depth and topN dynamically for different URLs? Thanks and Regards Deepa Devi Jayaveer =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
reg pagination
Hi I am using Nutch 2.1 with MySQL. The requirement is to crawl all the Paginated web pages. Say, for example, if I had given the Seed URL as the first page (page no:1 ) of some website (http://x.com?num=1) and by giving appropriate regular expression through URL filter to make nutch to crawl the pages with the pattern as "num" Nutc able to crawl the given URLs http://x.com?num=2 http://x.com?num=3 ... Nutch is successfully crawling if the pagination URL is given in the anchor tag(a href) for pagination. I was facing issue when the web pages had used some java script function to call the pagination by calling function like onPaginationSubmit() Nutch was not able to take crawl those pages. can anyone help to give solution on how to crawl those paginated pages? Thanks and Regards Deepa Devi =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
RE: reg custom plugin Runtime excpetion
Yeah correct...I built against 1.x version ..Thanks .. Is there any plugins available in 2.x to do HTML parse filter? Thanks and Regards Deepa Devi Jayaveer Mobile No: 9940662806 Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting From: Markus Jelsma To: user@nutch.apache.org Date: 02/19/2014 08:11 PM Subject: RE: reg custom plugin Runtime excpetion Looks like you're using 2.x, i think it is called ParseFilter there. How did you build it anyway, against 1.x perhaps? -Original message- > From:Deepa Jayaveer > Sent: Wednesday 19th February 2014 15:03 > To: user@nutch.apache.org > Subject: reg custom plugin Runtime excpetion > > I created custom plugin -filter -xpath jar using maven and added the jar > into /runtime/local folder > > When i try to crawl it, am getting the RuntimeException that extension > point does not exist\ > > java.lang.RuntimeException: Plugin (filter-xpath), extension point: > org.apache.nutch.parse.HtmlParseFilter does not exist. > at > org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:84) > at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99) > at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:97) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > Not sure where it is going wront. Can anyone help to resolve this? > > Thanks > Deepa > =-=-= > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > >
reg custom plugin Runtime excpetion
I created custom plugin -filter -xpath jar using maven and added the jar into /runtime/local folder When i try to crawl it, am getting the RuntimeException that extension point does not exist\ java.lang.RuntimeException: Plugin (filter-xpath), extension point: org.apache.nutch.parse.HtmlParseFilter does not exist. at org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:84) at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99) at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117) at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:97) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Not sure where it is going wront. Can anyone help to resolve this? Thanks Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
RE: sizing guide
Hi, How to make smaller mapper /reducer units ? -is it making less number of URLs in seed,txt? Thanks and Regards Deepa Devi Jayaveer From: Markus Jelsma To: user@nutch.apache.org Date: 02/13/2014 02:54 PM Subject: RE: sizing guide Hi, 10GB heap is a complete waste of memory and resources. 500MB heap is most cases enough. It is better to have more small mappers/reducers than a few large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch is not a long running process and does not benefit from a large heap or a lot of OS disk cache), unless you also have 64 CPU cores available. A rule of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. Cheers -Original message- > From:Deepa Jayaveer > Sent: Thursday 13th February 2014 8:09 > To: user@nutch.apache.org > Cc: user@nutch.apache.org > Subject: Re: sizing guide > > Thanks for your reply. > I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with > Hbase > once I get a fair idea about Nutch. > For our use case, I need to crawl large documents for > around 100 web sites > weekly and our functionality demands to crawl on daily basis or even > hourly basis to > extract specific information from around 20 different host. Say, > Need to extract product details from the retailer's site. > In that case, we need to recrawl the pages to get the latest information > > As you mentioned, I can do a batch delete the crawled html data once > I extract the information from the crawled data. I can expect the > crawled data roughly to be around 1 TB (could be deleted on scheduled > basis) > > Will these sizing be fine for Nutch installation in production? > 4 Node Hadoop cluster with 2 TB storage each > 64 GB RAM each > 10 GB heap > > Apart from that, need to do HBase data sizing to store the product > details(which > would be around 400 GB of data) > can I use the same HBase cluster to store the extracted data where Nutch > is raining > > Can you please let me know your suggestion or recommendations. > > > Thanks and Regards > Deepa Devi Jayaveer > Mobile No: 9940662806 > Tata Consultancy Services > Mailto: deepa.jayav...@tcs.com > Website: http://www.tcs.com > > Experience certainty. IT Services > Business Solutions > Consulting > > > > > From: > Tejas Patil > To: > "user@nutch.apache.org" > Date: > 02/13/2014 05:58 AM > Subject: > Re: sizing guide > > > > If you are looking for specific Nutch 2.1 + MySQL combination, I think > that > there won;t be any on the project wiki. > > There is no perfect answer for this as it depends on these factors (this > list may go on): > - Nature of data that you are crawling: small html files or large > documents. > - Is it a continuous crawl or few levels ? > - Are you re-crawling urls ? > - How big is the crawl space ? > - Is it a intranet crawl ? How frequently are the pages changed ? > > Nutch 1.x would be a perfect fit for prod level crawls. If you still want > to use Nutch 2.x, it would be better to switch to some other datastore > (eg. > HBase). > > Below are my experiences with two use cases wherein Nutch was used over > prod with Nutch 1.x: > > (A) Targeted crawl of a single host > In this case I wanted to get the data crawled quickly and didn't bother > about the updates that would happen to the pages. I started off with a > five > node Hadoop cluster but later did the math that it won't get my work done > in few days (remember that you need to have a delay between successive > requests which the server agrees on else your crawler is banned). Later I > bumped the cluster to 15 nodes. The pages were HTML files with size > roughly > 200k. The crawled data roughly needed 200GB and I had storage of about > 500GB. > > (B) Open crawl of several hosts > The configs and memory settings were driven by the prod hardware. I had a > 4 > node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every > hadoop job with an exception of generate job which needed more heap (8-10 > GB). There was no need to store the crawled data and every batch was > deleted as soon as it was processed. That said that disk had a capacity of > 2 TB. > > Thanks, > Tejas > > On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer > wrote: > > > Hi , > > Am using Nutch2.1 with MySQL. Is there a sizing guide available for > Nutch > > 2.1? > > Is there any recommendation
Re: sizing guide
Thanks a lot for your reply Thanks and Regards Deepa Devi Jayaveer Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting From: Tejas Patil To: "user@nutch.apache.org" Date: 02/13/2014 02:29 PM Subject: Re: sizing guide On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer wrote: > Thanks for your reply. > I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with > Hbase > once I get a fair idea about Nutch. > For our use case, I need to crawl large documents for > around 100 web sites > weekly and our functionality demands to crawl on daily basis or even > hourly basis to > extract specific information from around 20 different host. Say, > Need to extract product details from the retailer's site. > In that case, we need to recrawl the pages to get the latest information > > As you mentioned, I can do a batch delete the crawled html data once > I extract the information from the crawled data. I can expect the > crawled data roughly to be around 1 TB (could be deleted on scheduled > basis) > If you process the data as soon it is available, then you might not need to have 1 TB.. unless Nutch gets that much data in a single fetch cycle. > > Will these sizing be fine for Nutch installation in production? > 4 Node Hadoop cluster with 2 TB storage each > 64 GB RAM each > 10 GB heap > Looks fine. You need to monitor the crawl for first week or two so as to know if you need to change this setup. > > Apart from that, need to do HBase data sizing to store the product > details(which > would be around 400 GB of data) > can I use the same HBase cluster to store the extracted data where Nutch > is raining > Yes you can. HBase is a black box to me and it would have a bunch of its own configs which you could tune. > > Can you please let me know your suggestion or recommendations. > > > Thanks and Regards > Deepa Devi Jayaveer > Mobile No: 9940662806 > Tata Consultancy Services > Mailto: deepa.jayav...@tcs.com > Website: http://www.tcs.com > > Experience certainty. IT Services > Business Solutions > Consulting > > > > > From: > Tejas Patil > To: > "user@nutch.apache.org" > Date: > 02/13/2014 05:58 AM > Subject: > Re: sizing guide > > > > If you are looking for specific Nutch 2.1 + MySQL combination, I think > that > there won;t be any on the project wiki. > > There is no perfect answer for this as it depends on these factors (this > list may go on): > - Nature of data that you are crawling: small html files or large > documents. > - Is it a continuous crawl or few levels ? > - Are you re-crawling urls ? > - How big is the crawl space ? > - Is it a intranet crawl ? How frequently are the pages changed ? > > Nutch 1.x would be a perfect fit for prod level crawls. If you still want > to use Nutch 2.x, it would be better to switch to some other datastore > (eg. > HBase). > > Below are my experiences with two use cases wherein Nutch was used over > prod with Nutch 1.x: > > (A) Targeted crawl of a single host > In this case I wanted to get the data crawled quickly and didn't bother > about the updates that would happen to the pages. I started off with a > five > node Hadoop cluster but later did the math that it won't get my work done > in few days (remember that you need to have a delay between successive > requests which the server agrees on else your crawler is banned). Later I > bumped the cluster to 15 nodes. The pages were HTML files with size > roughly > 200k. The crawled data roughly needed 200GB and I had storage of about > 500GB. > > (B) Open crawl of several hosts > The configs and memory settings were driven by the prod hardware. I had a > 4 > node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every > hadoop job with an exception of generate job which needed more heap (8-10 > GB). There was no need to store the crawled data and every batch was > deleted as soon as it was processed. That said that disk had a capacity of > 2 TB. > > Thanks, > Tejas > > On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer > wrote: > > > Hi , > > Am using Nutch2.1 with MySQL. Is there a sizing guide available for > Nutch > > 2.1? > > Is there any recommendations could be ginven on sizing memory,CP
Re: sizing guide
Thanks for your reply. I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with Hbase once I get a fair idea about Nutch. For our use case, I need to crawl large documents for around 100 web sites weekly and our functionality demands to crawl on daily basis or even hourly basis to extract specific information from around 20 different host. Say, Need to extract product details from the retailer's site. In that case, we need to recrawl the pages to get the latest information As you mentioned, I can do a batch delete the crawled html data once I extract the information from the crawled data. I can expect the crawled data roughly to be around 1 TB (could be deleted on scheduled basis) Will these sizing be fine for Nutch installation in production? 4 Node Hadoop cluster with 2 TB storage each 64 GB RAM each 10 GB heap Apart from that, need to do HBase data sizing to store the product details(which would be around 400 GB of data) can I use the same HBase cluster to store the extracted data where Nutch is raining Can you please let me know your suggestion or recommendations. Thanks and Regards Deepa Devi Jayaveer Mobile No: 9940662806 Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting From: Tejas Patil To: "user@nutch.apache.org" Date: 02/13/2014 05:58 AM Subject: Re: sizing guide If you are looking for specific Nutch 2.1 + MySQL combination, I think that there won;t be any on the project wiki. There is no perfect answer for this as it depends on these factors (this list may go on): - Nature of data that you are crawling: small html files or large documents. - Is it a continuous crawl or few levels ? - Are you re-crawling urls ? - How big is the crawl space ? - Is it a intranet crawl ? How frequently are the pages changed ? Nutch 1.x would be a perfect fit for prod level crawls. If you still want to use Nutch 2.x, it would be better to switch to some other datastore (eg. HBase). Below are my experiences with two use cases wherein Nutch was used over prod with Nutch 1.x: (A) Targeted crawl of a single host In this case I wanted to get the data crawled quickly and didn't bother about the updates that would happen to the pages. I started off with a five node Hadoop cluster but later did the math that it won't get my work done in few days (remember that you need to have a delay between successive requests which the server agrees on else your crawler is banned). Later I bumped the cluster to 15 nodes. The pages were HTML files with size roughly 200k. The crawled data roughly needed 200GB and I had storage of about 500GB. (B) Open crawl of several hosts The configs and memory settings were driven by the prod hardware. I had a 4 node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every hadoop job with an exception of generate job which needed more heap (8-10 GB). There was no need to store the crawled data and every batch was deleted as soon as it was processed. That said that disk had a capacity of 2 TB. Thanks, Tejas On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer wrote: > Hi , > Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch > 2.1? > Is there any recommendations could be ginven on sizing memory,CPU and > Disk Space for crawling. > > Thanks and Regards > Deepa Devi Jayaveer > Mobile No: 9940662806 > Tata Consultancy Services > Mailto: deepa.jayav...@tcs.com > Website: http://www.tcs.com > > Experience certainty. IT Services > Business Solutions > Consulting > > =-=-= > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > >
sizing guide
Hi , Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch 2.1? Is there any recommendations could be ginven on sizing memory,CPU and Disk Space for crawling. Thanks and Regards Deepa Devi Jayaveer Mobile No: 9940662806 Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Getting this response code 407 while crawling
Hi, Getting this response code as 407 when we try to call the website through our company proxy. I hope that it is taking the user id properly as my systems gets locked after few retries by the crawler. I guess that the password is not setting correctly. Do we need to encrypt and add it in httpclient.auth.xml? Attaching the log 2014-01-31 16:05:16,852 INFO httpclient.HttpResponse - url http://www.google.com 2014-01-31 16:05:16,921 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2014-01-31 16:05:16,923 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected 2014-01-31 16:05:16,923 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2014-01-31 16:05:16,924 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2014-01-31 16:05:16,948 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2014-01-31 16:05:16,949 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2014-01-31 16:05:17,193 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2014-01-31 16:05:17,194 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2014-01-31 16:05:17,195 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM @172.20.181.138:8080 2014-01-31 16:05:17,195 INFO httpclient.HttpResponse - code check 407 can you please help us to resolve this issue. Thanks and Regards Deepa Devi Jayaveer Tata Consultancy Services Mailto: deepa.jayav...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you