Nutch selenium

2016-06-03 Thread Deepa Jayaveer
Hi,
We are trying to run Nutch with selenium and getting error as "GDK_BACKEND 
does not match available displays" . We tried a lot
to reslove this. can anyone help on this
am getting this error only when i run Nutch in Hadoop cluster. It is 
working perfectly in standalone.

Error
org.openqa.selenium.firefox.NotConnectedException: Unable to connect to 
host localhost on port 7057 after 45000 ms. Firefox console output:
Error: GDK_BACKEND does not match available displays

at 
org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:113)
at 
org.openqa.selenium.firefox.FirefoxDriver.startClient(FirefoxDriver.java:271)
at 
org.openqa.selenium.remote.RemoteWebDriver.(RemoteWebDriver.java:119)
at 
org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:216)
at 
org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:211)

Thanks & Regards
Deepa Devi Jayaveer
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




NoRouteToHostException in 2 node cluster

2016-03-01 Thread Deepa Jayaveer
Hi ,
When we try to run nutch with 2 node cluster , am getting 
NoRouteToHostException.
can you please help to get to resolve this. 


Thanks & Regards
Deepa Devi Jayaveer
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Re: Nutch 2.4 -Hadoop2 -mysql compatibility

2016-02-29 Thread Deepa Jayaveer
Hi ,
can you please help on this. Is latest version of Gora doesn't support for 
RDBMS?

am trying to run Nutch 2.4 in distributed environment with MySQL as a 
database. am facing an issue that 
webpage schema is not getting created in the database.
It is working fine with HBase. can you please let me know the 
compatibility of Nutch 2.4 with MySql?





From:   Deepa Jayaveer/CHN/TCS
To: user@nutch.apache.org
Date:   25-02-2016 16:01
Subject:Nutch 2.4  -Hadoop2 -mysql compatibility


Hi 
am trying to run Nutch 2.4 in distributed environment with MySQL as a 
database. am facing an issue that 
webpage schema is not getting created in the database.
It is working fine with HBase. can you please let me know the 
compatibility of Nutch 2.4 with MySql?


Thanks & Regards
Deepa Devi Jayaveer

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Nutch 2.4 -Hadoop2 -mysql compatibility

2016-02-25 Thread Deepa Jayaveer
Hi 
am trying to run Nutch 2.4 in distributed environment with MySQL as a 
database. am facing an issue that 
webpage schema is not getting created in the database.
It is working fine with HBase. can you please let me know the 
compatibility of Nutch 2.4 with MySql?


Thanks & Regards
Deepa Devi Jayaveer
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




nutch hbase error

2015-06-25 Thread Deepa Jayaveer
Hi,
I tried to integrate Nutch with Hbase. I have used the given versions
Nutch - 2.3
Hadoop Version - 1.0.1
HBase Version - 0.94.14
Zookeeper - 3.4.5
Please let me know the versions used are correct or I have to upgrade or 
downgrade the versions. can anybody help to fix the given error.
Wen i try to run , am getting the following exception 
**Log***
2015-06-25 17:03:07,169 ERROR crawl.GeneratorJob - GeneratorJob: 
org.apache.gora.util.GoraException: java.lang.RuntimeException: 
org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
2015-06-25 17:19:51,302 ERROR crawl.InjectorJob - InjectorJob: 
org.apache.gora.util.GoraException: java.lang.RuntimeException: 
org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 
times
Zookeeper log shows the IP address of the system in which I am running 
Nutch with HBase. My IP is ***
2015-06-25 19:30:20,976 INFO 
org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket 
connection from /***:***
2015-06-25 19:30:20,976 INFO org.apache.zookeeper.server.ZooKeeperServer: 
Clientattempting to establish new session at /***:***
2015-06-25 19:30:20,993 INFO org.apache.zookeeper.server.ZooKeeperServer: 
Established session --- with negotiated timeout 18 for client /***:***
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Re: http 501 error

2015-06-11 Thread Deepa Jayaveer
Thanks a lot for your response.
will Nutch can handle POST request?

Thanks
Deepa







From:   Gora Mohanty 
To: user@nutch.apache.org
Date:   11-06-2015 15:23
Subject:Re: http 501 error



Hi,

A HTTP 501 error is a method not implemented error, as you could have
searched and found out. What that means is that the server you are trying
to crawl does not implement GET for that URL.

Regards,
Gora


On 11 June 2015 at 14:37, Deepa Jayaveer  wrote:

> Hi All,
>
> When I try to crawl the website ,am getting the  Http response code as 
501
> . while debugging , found
> out that the error occurred when the following code executed in
> HttpResponse.java
>
> GetMethod get = new GetMethod(url.toString());
> int code = httpClient.executeMethod(get);
>
> code returns 501 . Do i need to change anything on Httpclient program
> can you please help to fix this
>
> Thanks
> Deepa
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>



http 501 error

2015-06-11 Thread Deepa Jayaveer
Hi All,

When I try to crawl the website ,am getting the  Http response code as 501 
. while debugging , found
out that the error occurred when the following code executed in 
HttpResponse.java

GetMethod get = new GetMethod(url.toString());
int code = httpClient.executeMethod(get);

code returns 501 . Do i need to change anything on Httpclient program
can you please help to fix this

Thanks
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Re: [MASSMAIL]dynamic content from the web pages

2015-06-08 Thread Deepa Jayaveer
Thanks for your mail.  yes, the different prices are loaded based on the 
size and it is dynamically loaded using javascript. 
Am using Nutch 2.1 .   will  nutch selenium resolve the issue?

Thanks and Regards
Deepa





From:   Jorge Luis Betancourt González 
To: user@nutch.apache.org
Date:   08-06-2015 18:32
Subject:Re: [MASSMAIL]dynamic content  from the web pages



I think you need to specify a little more of details how the different 
prices are loaded in the site? Is this dynamically loaded using javascript 
when you select the size from a select? If this is the case one way to go 
is using the nutch-selenium plugin, what Nutch version are you using?

Regards,

- Original Message -
From: "Deepa Jayaveer" 
To: user@nutch.apache.org
Sent: Monday, June 8, 2015 5:17:27 AM
Subject: [MASSMAIL]dynamic content  from the web pages

Hi All,
How to retrieve the dynamic content  from the web pages.  Say, if  I want 
to retrieve prices of shoes for different sizes  from the shopping
web site? 
As the web pages wont be different from various shoe sizes  then no clue 
on how to retrieve.
Any help?


Thanks 
Deepa

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you





dynamic content from the web pages

2015-06-08 Thread Deepa Jayaveer
Hi All,
How to retrieve the dynamic content  from the web pages.  Say, if  I want 
to retrieve prices of shoes for different sizes  from the shopping
web site? 
As the web pages wont be different from various shoe sizes  then no clue 
on how to retrieve.
Any help?


Thanks 
Deepa

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




ret Errror HTTP 307

2015-03-18 Thread Deepa Jayaveer
Hi All,
When we try to crawl a web page , getting an error response as  the 
following. But  I tried the same page in JSoup , it is working fine. 
can you please let us know the reason


HTTP/1.1 307 Authentication Required
Date: Thu, 19 Mar 2015 04:22:46 GMT
Proxy-Connection: close
Via: 1.1 localhost.localdomain
Cache-Control: no-store
Content-Type: text/html
Content-Language: en
Location:  xxx
Connection: close
Content-Length: 243


Thanks and Regards
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




reg crawled pages with status=2

2014-06-24 Thread Deepa Jayaveer
Hi,
  our requirement is that the Nutch should not recrawl crawl the pages 
that was being already crawled. 
ie., the crawling should not happen for the web pages with the status as 
'2' in the webpage table. It should not recrawl and should
not put the outlinks as well.

can you please let me know whether it is possible by changing some 
configuration parameters in nutch site xml?

Thanks and Regards
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




setting up depth and topN dynamically

2014-03-25 Thread Deepa Jayaveer
Hi,
I need to crawl around 3 URLs per day. In which i need to set dynamic 
depth and
topN .Is there any configuration where i can set up depth and topN 
dynamically  for different
URLs?


Thanks and Regards
Deepa Devi Jayaveer

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




reg pagination

2014-03-14 Thread Deepa Jayaveer
Hi 
 I am using Nutch 2.1 with MySQL.  The requirement is to crawl all the 
Paginated  web pages.

Say, for example, if I had given the Seed URL as the first page (page no:1 
) of some website (http://x.com?num=1)
and by  giving appropriate regular expression through URL filter  to make 
nutch to crawl the pages with the pattern as  "num"
Nutc able to crawl the given URLs
http://x.com?num=2
http://x.com?num=3 ...

Nutch is successfully  crawling  if the pagination  URL is given in the 
anchor tag(a href) for pagination.

 I was facing issue when the web pages had used some java script function 
to call the pagination by 
calling  function like onPaginationSubmit()

Nutch was not able to take crawl those pages. can anyone help to give 
solution on how to crawl those paginated pages?




Thanks and Regards
Deepa Devi 
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




RE: reg custom plugin Runtime excpetion

2014-02-19 Thread Deepa Jayaveer
Yeah correct...I built against 1.x version ..Thanks ..
Is there any plugins available in 2.x to do HTML parse filter?

Thanks and Regards
Deepa Devi Jayaveer
Mobile No: 9940662806
Tata Consultancy Services
Mailto: deepa.jayav...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting




From:
Markus Jelsma 
To:
user@nutch.apache.org 
Date:
02/19/2014 08:11 PM
Subject:
RE: reg custom plugin Runtime excpetion



Looks like you're using 2.x, i think it is called ParseFilter there. How 
did you build it anyway, against 1.x perhaps?

 
 
-Original message-
> From:Deepa Jayaveer 
> Sent: Wednesday 19th February 2014 15:03
> To: user@nutch.apache.org
> Subject: reg custom plugin Runtime excpetion
> 
> I created custom plugin  -filter -xpath jar using maven and added the 
jar 
> into  /runtime/local folder
> 
> When i try to crawl  it, am getting the RuntimeException that extension 
> point does not exist\
> 
> java.lang.RuntimeException: Plugin (filter-xpath), extension point: 
> org.apache.nutch.parse.HtmlParseFilter does not exist.
> at 
> 
org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:84)
> at 
org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
> at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:97)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 

> 
> Not sure where it is going wront. Can anyone help to resolve this?
> 
> Thanks 
> Deepa
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 
> 




reg custom plugin Runtime excpetion

2014-02-19 Thread Deepa Jayaveer
I created custom plugin  -filter -xpath jar using maven and added the jar 
into  /runtime/local folder

When i try to crawl  it, am getting the RuntimeException that extension 
point does not exist\

java.lang.RuntimeException: Plugin (filter-xpath), extension point: 
org.apache.nutch.parse.HtmlParseFilter does not exist.
at 
org.apache.nutch.plugin.PluginRepository.(PluginRepository.java:84)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:117)
at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:97)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 

Not sure where it is going wront. Can anyone help to resolve this?

Thanks 
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




RE: sizing guide

2014-02-13 Thread Deepa Jayaveer
Hi,
How to make smaller mapper /reducer units  ? -is it making less number of 
URLs in seed,txt? 


Thanks and Regards
Deepa Devi Jayaveer




From:
Markus Jelsma 
To:
user@nutch.apache.org 
Date:
02/13/2014 02:54 PM
Subject:
RE: sizing guide



Hi,

10GB heap is a complete waste of memory and resources. 500MB heap is most 
cases enough. It is better to have more small mappers/reducers than a few 
large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch 
is not a long running process and does not benefit from a large heap or a 
lot of OS disk cache), unless you also have 64 CPU cores available. A rule 
of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. 

Cheers 

 
 
-Original message-
> From:Deepa Jayaveer 
> Sent: Thursday 13th February 2014 8:09
> To: user@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: Re: sizing guide
> 
> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
> Hbase 
> once I get a fair idea about Nutch.
> For our use case, I need to crawl large documents for 
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even 
> hourly basis to 
> extract specific information from around 20 different host. Say, 
> Need to extract product details from the retailer's site. 
> In that case, we need to recrawl the pages to get the latest information
> 
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the 
> crawled data roughly to be around  1 TB (could be deleted on scheduled 
> basis)
> 
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
> 
> Apart from that, need to do HBase data sizing to store the product 
> details(which
> would be around 400 GB of data) 
> can I use the same HBase cluster to store the extracted data where Nutch 

> is raining 
> 
> Can you please let me know your suggestion or recommendations.
> 
> 
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayav...@tcs.com
> Website: http://www.tcs.com
> 
> Experience certainty.   IT Services
> Business Solutions
> Consulting
> 
> 
> 
> 
> From:
> Tejas Patil 
> To:
> "user@nutch.apache.org" 
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
> 
> 
> 
> If you are looking for specific Nutch 2.1 + MySQL combination, I think 
> that
> there won;t be any on the project wiki.
> 
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large 
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
> 
> Nutch 1.x would be a perfect fit for prod level crawls. If you still 
want
> to use Nutch 2.x, it would be better to switch to some other datastore 
> (eg.
> HBase).
> 
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
> 
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a 
> five
> node Hadoop cluster but later did the math that it won't get my work 
done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later 
I
> bumped the cluster to 15 nodes. The pages were HTML files with size 
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
> 
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had 
a 
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap 
(8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity 
of
> 2 TB.
> 
> Thanks,
> Tejas
> 
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
> wrote:
> 
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
> Nutch
> > 2.1?
> > Is there any recommendation

Re: sizing guide

2014-02-13 Thread Deepa Jayaveer
Thanks a lot for your reply

Thanks and Regards
Deepa Devi Jayaveer
Tata Consultancy Services
Mailto: deepa.jayav...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting




From:
Tejas Patil 
To:
"user@nutch.apache.org" 
Date:
02/13/2014 02:29 PM
Subject:
Re: sizing guide



On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer 
wrote:

> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
> Hbase
> once I get a fair idea about Nutch.
> For our use case, I need to crawl large documents for
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even
> hourly basis to
> extract specific information from around 20 different host. Say,
> Need to extract product details from the retailer's site.
> In that case, we need to recrawl the pages to get the latest information
>
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the
> crawled data roughly to be around  1 TB (could be deleted on scheduled
> basis)
>

If you process the data as soon it is available, then you might not need 
to
have 1 TB.. unless Nutch gets that much data in a single fetch cycle.

>
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
>

Looks fine. You need to monitor the crawl for first week or two so as to
know if you need to change this setup.

>
> Apart from that, need to do HBase data sizing to store the product
> details(which
> would be around 400 GB of data)
> can I use the same HBase cluster to store the extracted data where Nutch
> is raining
>

Yes you can. HBase is a black box to me and it would have a bunch of its
own configs which you could tune.

>
> Can you please let me know your suggestion or recommendations.
>
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayav...@tcs.com
> Website: http://www.tcs.com
> 
> Experience certainty.   IT Services
> Business Solutions
> Consulting
> 
>
>
>
> From:
> Tejas Patil 
> To:
> "user@nutch.apache.org" 
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
>
>
>
> If you are looking for specific Nutch 2.1 + MySQL combination, I think
> that
> there won;t be any on the project wiki.
>
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
>
> Nutch 1.x would be a perfect fit for prod level crawls. If you still 
want
> to use Nutch 2.x, it would be better to switch to some other datastore
> (eg.
> HBase).
>
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
>
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a
> five
> node Hadoop cluster but later did the math that it won't get my work 
done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later 
I
> bumped the cluster to 15 nodes. The pages were HTML files with size
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
>
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had 
a
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap 
(8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity 
of
> 2 TB.
>
> Thanks,
> Tejas
>
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
> wrote:
>
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CP

Re: sizing guide

2014-02-12 Thread Deepa Jayaveer
Thanks for your reply.
  I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
Hbase 
once I get a fair idea about Nutch.
For our use case, I need to crawl large documents for 
around 100 web sites
 weekly  and our functionality demands to crawl on daily basis or even 
hourly basis to 
extract specific information from around 20 different host. Say, 
Need to extract product details from the retailer's site. 
In that case, we need to recrawl the pages to get the latest information

As you mentioned, I can do a batch delete the crawled html data once
I extract the information from the crawled data. I can expect the 
crawled data roughly to be around  1 TB (could be deleted on scheduled 
basis)

Will these sizing be fine for Nutch installation in production?
4 Node Hadoop cluster with 2 TB storage each
64 GB RAM each
10 GB heap

Apart from that, need to do HBase data sizing to store the product 
details(which
would be around 400 GB of data) 
can I use the same HBase cluster to store the extracted data where Nutch 
is raining 
   
Can you please let me know your suggestion or recommendations.


Thanks and Regards
Deepa Devi Jayaveer
Mobile No: 9940662806
Tata Consultancy Services
Mailto: deepa.jayav...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting




From:
Tejas Patil 
To:
"user@nutch.apache.org" 
Date:
02/13/2014 05:58 AM
Subject:
Re: sizing guide



If you are looking for specific Nutch 2.1 + MySQL combination, I think 
that
there won;t be any on the project wiki.

There is no perfect answer for this as it depends on these factors (this
list may go on):
- Nature of data that you are crawling: small html files or large 
documents.
- Is it a continuous crawl or few levels ?
- Are you re-crawling urls ?
- How big is the crawl space ?
- Is it a intranet crawl ? How frequently are the pages changed ?

Nutch 1.x would be a perfect fit for prod level crawls. If you still want
to use Nutch 2.x, it would be better to switch to some other datastore 
(eg.
HBase).

Below are my experiences with two use cases wherein Nutch was used over
prod with Nutch 1.x:

(A) Targeted crawl of a single host
In this case I wanted to get the data crawled quickly and didn't bother
about the updates that would happen to the pages. I started off with a 
five
node Hadoop cluster but later did the math that it won't get my work done
in few days (remember that you need to have a delay between successive
requests which the server agrees on else your crawler is banned). Later I
bumped the cluster to 15 nodes. The pages were HTML files with size 
roughly
200k. The crawled data roughly needed 200GB and I had storage of about
500GB.

(B) Open crawl of several hosts
The configs and memory settings were driven by the prod hardware. I had a 
4
node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
hadoop job with an exception of generate job which needed more heap (8-10
GB). There was no need to store the crawled data and every batch was
deleted as soon as it was processed. That said that disk had a capacity of
2 TB.

Thanks,
Tejas

On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
wrote:

> Hi ,
> Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
Nutch
> 2.1?
> Is there any recommendations could be ginven on  sizing memory,CPU and
> Disk Space for crawling.
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: deepa.jayav...@tcs.com
> Website: http://www.tcs.com
> 
> Experience certainty.   IT Services
> Business Solutions
> Consulting
> 
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>




sizing guide

2014-02-12 Thread Deepa Jayaveer
Hi ,
Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch 
2.1? 
Is there any recommendations could be ginven on  sizing memory,CPU and 
Disk Space for crawling.

Thanks and Regards
Deepa Devi Jayaveer
Mobile No: 9940662806
Tata Consultancy Services
Mailto: deepa.jayav...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Getting this response code 407 while crawling

2014-01-31 Thread Deepa Jayaveer
Hi,
Getting this  response code as 407 when we try to call the website through 
our company proxy. I hope that it  is taking
the user id properly as my systems gets locked after few  retries by the 
crawler.
I guess that the password is not  setting correctly. Do we need to encrypt 
and add it in httpclient.auth.xml?




Attaching the log
2014-01-31 16:05:16,852 INFO  httpclient.HttpResponse - url 
http://www.google.com
2014-01-31 16:05:16,921 DEBUG auth.AuthChallengeProcessor - Supported 
authentication schemes in the order of preference: [ntlm, digest, basic]
2014-01-31 16:05:16,923 INFO  auth.AuthChallengeProcessor - ntlm 
authentication scheme selected
2014-01-31 16:05:16,923 DEBUG auth.AuthChallengeProcessor - Using 
authentication scheme: ntlm
2014-01-31 16:05:16,924 DEBUG auth.AuthChallengeProcessor - Authorization 
challenge processed
2014-01-31 16:05:16,948 DEBUG auth.AuthChallengeProcessor - Using 
authentication scheme: ntlm
2014-01-31 16:05:16,949 DEBUG auth.AuthChallengeProcessor - Authorization 
challenge processed
2014-01-31 16:05:17,193 DEBUG auth.AuthChallengeProcessor - Using 
authentication scheme: ntlm
2014-01-31 16:05:17,194 DEBUG auth.AuthChallengeProcessor - Authorization 
challenge processed
2014-01-31 16:05:17,195 INFO  httpclient.HttpMethodDirector - Failure 
authenticating with NTLM @172.20.181.138:8080
2014-01-31 16:05:17,195 INFO  httpclient.HttpResponse - code check 407

can you please help us to resolve this issue. 

Thanks and Regards
Deepa Devi Jayaveer
Tata Consultancy Services
Mailto: deepa.jayav...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you