Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

2015-03-02 Thread Meraj A. Khan
: Meraj A. Khan mera...@gmail.com To: user@nutch.apache.org Sent: Saturday, February 28, 2015 12:09:47 AM Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? Hi Jorge, Yes, I was exploring changing the http.agent.name property value in case where the sites either

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

2015-03-02 Thread Meraj A. Khan
in traffic. Regards, - Original Message - From: Meraj A. Khan mera...@gmail.com To: user@nutch.apache.org Sent: Saturday, February 28, 2015 12:09:47 AM Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? Hi Jorge, Yes, I was exploring changing

Re: Can anyone fetch this page?

2015-02-27 Thread Meraj A. Khan
Can you please set the user agent to something that resembles a browser like Chrome for example and test? I just posted a query yesterday for a similar issue where the mobile version of the site gets served up instead of 500. On Fri, Feb 27, 2015 at 1:08 PM, Iain Lopata ilopa...@hotmail.com

How to make Nutch 1.7 request mimic a browser?

2015-02-26 Thread Meraj A. Khan
In some instances the content that is downloaded in Fetch phase from a HTTP URL is not what you would get if you were to access the request from a well known browser like Google Chrome for example, that is because the server is expecting a user agent value that represents a browser. There is a

NUTCH-762 Generate Multiple Segments

2015-02-18 Thread Meraj A. Khan
Hi Folks, I am facing the exact same problem that is described in JIRA NUTCH-762 , i.e the generate -updates takes excessive amount of time and the actual fetch only takes very less time compared to the generate time. The Jira issue commits a patch to allow generating of multiple segments in a

Re: Depth option

2015-01-04 Thread Meraj A. Khan
Shadi, I am not sure what will be the case if example.com itself has external links,I think it will fetch those with depth 1,but if you want to disbale the fetching of external links , just set the external.links property to false,you dont need any url filter set up if you do so. On Jan 4, 2015

Re: Nutch running time

2015-01-03 Thread Meraj A. Khan
...@intel.com wrote: How can I configure number of map reduce? Which parameter is it? More map reduce will make it slower or faster? Thanks -Original Message- From: Meraj A. Khan [mailto:mera...@gmail.com] Sent: Thursday, January 01, 2015 15:17 To: user@nutch.apache.org Subject: Re: Nutch

Re: Question about db.default.fetch.interval.

2015-01-03 Thread Meraj A. Khan
, 2014 at 11:18 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I have a quick question regarding the db.default.fetch.interval parameter , I have currently set it to 15 days , however my crawl cycle itself is going beyond 15 days and upto 30 days , now I was not sure since I have set

Re: Nutch running time

2015-01-01 Thread Meraj A. Khan
It seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote: Hi all, I wanted to know how long nutch should run. I change the configurations, and

Re: nutch on amazon emr

2015-01-01 Thread Meraj A. Khan
/libs/script-runner/script-runner.jar Regards Adil I. Abbasi On Thu, Jan 1, 2015 at 8:51 PM, Meraj A. Khan mera...@gmail.com wrote: Can you give us the command that you use to start the crawl? On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: When I try to nutch

Re: nutch on amazon emr

2015-01-01 Thread Meraj A. Khan
Can you give us the command that you use to start the crawl? On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: When I try to nutch crawl script on amazon emr, it gives me this error /mnt/var/lib/hadoop/steps/s-3VT1QRVSURPSH/./crawl: line 81: hdfs:///nutch/bin/nutch: No

Question about db.default.fetch.interval.

2014-12-28 Thread Meraj A. Khan
Hi All, I have a quick question regarding the db.default.fetch.interval parameter , I have currently set it to 15 days , however my crawl cycle itself is going beyond 15 days and upto 30 days , now I was not sure since I have set the db.default.fetch.interval to be only 15 days , is there a

Re: Nutch configuration - V1 vs V2 differences

2014-11-12 Thread Meraj A. Khan
I installed it by copying the files to conf directory, never tried without that step to confirm if the copying is really needed. On Nov 12, 2014 6:24 AM, mikejf12 i...@semtech-solutions.co.nz wrote: Hi I installed two version of Nutch on to a Centos 6 Linux Hadoop V1.2.1 cluster. I didnt

Re: When to delete the segments?

2014-11-03 Thread Meraj A. Khan
, if the data is to be pushed to Solr (e.g. with bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT), then after indexing is done you can get rid of the segment On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan mera...@gmail.com wrote: Thanks . How do I definitively determine

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-11-02 Thread Meraj A. Khan
accordingly. Julien On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, On further analysis , I found that it was not a delay at reduce time , but a long running fetch map task , when I have multiple fetch map tasks running on a single segment

When to delete the segments?

2014-11-02 Thread Meraj A. Khan
Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than

Re: When to delete the segments?

2014-11-02 Thread Meraj A. Khan
is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am deleting the segments as soon

bin/Crawl script loosing status updates from the MR job.

2014-10-30 Thread Meraj A. Khan
Hi All, I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN by redirecting its output to a log file as shown below. /opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000 /tmp/nutch.log 21 The issue I am facing is that randomly this script when it is running a

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Meraj A. Khan
for the map tasks accordingly. Julien On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, On further analysis , I found that it was not a delay at reduce time , but a long running fetch map task , when I have multiple fetch map tasks running on a single segment

Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-10-26 Thread Meraj A. Khan
Fred, In my last email on this topic , I mentioned that I am using a single segment and multiple fetch map tasks, and also the changes that I had to make to Nutch 1.7 to make it possible on YARN. Let me know if you cannot find it and I ll resend those again. Meraj. On Fri, Oct 24, 2014 at

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-26 Thread Meraj A. Khan
19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, Thanks for your suggestion , I looked at the jstack thread dumps , and I could see that the fetcher threads are in a waiting state and actually the map phase is not yet complete looking at the JobClient console. 14/10/15 12:09:48

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-17 Thread Meraj A. Khan
to see what it is busy doing, that will be a simple of way of checking that this is indeed the source of the problem. See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible solution J. On 16 October 2014 06:08, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am running

Reduce phase in Fetcher taking excessive time to finish.

2014-10-15 Thread Meraj A. Khan
Hi All, I am running into a situation where the reduce phase of the fetch job with parsing enabled at the time of fetch is taking excessively long amount of time , I have seen recommendations to filter the URLs based on length to avoid normalization related delays ,I am not filtering any URLs

Re: Generated Segment Too Large

2014-10-07 Thread Meraj A. Khan
Markus, I have been using Nucth for a while , but I wasnt clear about this issue, thank you for reminding me that this is Nucth 101 :) I will go ahead and use topN as the segment size control mechanism, although I have one question regarding topN , i.e if I have topN value of 1000 and if there

Generated Segment Too Large

2014-10-06 Thread Meraj A. Khan
Hi Folks, I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of controlling the segment size and since a single segment is being created which is very large for the capacity of my Hadoop cluster, I have a available storage of ~3TB , but since Hadoop generates the spill*.out

Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-09-25 Thread Meraj A. Khan
24, 2014 at 6:14 PM, Meraj A. Khan mera...@gmail.com wrote: Folks, As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN cluster . In order to scale I would need to Fetch concurrently with multiple map tasks on multiple nodes ,I think that the first step to do so would

Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-09-24 Thread Meraj A. Khan
Folks, As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN cluster . In order to scale I would need to Fetch concurrently with multiple map tasks on multiple nodes ,I think that the first step to do so would be to generate multiple segments in the generate phase so that

Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of

RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi

RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar

Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Meraj A. Khan
logic for detecting that they've all finished before you move on to the update step. Out of curiosity : why do you want to fetch multiple segments at the same time? On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote: Hello Folks, I am unable to run multiple fetch Map taks

Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Meraj A. Khan
in your cluster. Cheers Jake On Sep 19, 2014, at 1:52 PM, Meraj A. Khan mera...@gmail.com wrote: Julien, How would you achieve parallelism then on a Hadoop cluster , am I missing something here? My understanding was that we could scale the crawl by allowing fetch to happen in multiple

Re: Fetch Job Started Failing on Hadoop Cluster

2014-09-16 Thread Meraj A. Khan
Markus, Thanks, the issue I was setting the PATH variable in the bin/crawl script and once I removed it and set it outside of the bin/crawl script , it started working fine now. On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - you made Nutch believe that

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-08 Thread Meraj A. Khan
and generate exact one mapper, although I had changed mode=distributed, any idea about this please? Many regards, Simon On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan mera...@gmail.com wrote: I think that is a typo , and it is actually CrawlDirectory. For the single map task issue although I

Re: Nutch 1.7 fetch happening in a single map task.

2014-09-07 Thread Meraj A. Khan
/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-31 Thread Meraj A. Khan
/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Meraj A. Khan
with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering with the (now deprecated) Crawl class? It comes with a good default configuration and should make your life easier. Julien On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com

Nutch 1.7 fetch happening in a single map task.

2014-08-27 Thread Meraj A. Khan
Hi All, I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there is only a single reducer in the generate partition job. I am running in a situation where the subsequent fetch is only running in a single map task (I believe as a consequence of a single reducer in the earlier

Nutch 1.7 on Hadoop Yarn 2.3.0 performing only 3 rounds of fetching.

2014-08-24 Thread Meraj A. Khan
Hi All, After spending some time on this I was able to diagnose the problem that when I submit the Nutch 1.7 job to a Hadoop Yarn Cluster , I notice that in the Hadoop UI , it lists the tasks that its executing , only 3 rounds of fetch happen , even though I have given a depth on 100 and my seed

Re: Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

2014-06-26 Thread Meraj A. Khan
Perfect, thank you Julien! On Thu, Jun 26, 2014 at 10:21 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: If I set fetcher.threads.per.queue property to more than 1 , I believe the behavior would be to have those many number of threads per host from Nutch, in that case would

Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
Hello Folks, I have noticed that Nutch resources and mailing lists are mostly geared towards the usage of Nutch in research oriented projects , I would like to know from those of you who are using Nutch in production for large scale crawling (vertical or non-vertical) about what challenges to

Re: Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
probably would not block access, and by Nutch variant , I meant an instance of a customized crawler based on Nutch. Thanks. On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty g...@mimirtech.com wrote: On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote: Hello Folks, I have noticed

Re: Relationship between fetcher.threads.fetch and fetcher.threads.per.host

2014-06-22 Thread Meraj A. Khan
Sebastian, Thanks for the clear explanation , I have a similar questions . 1. If I set the fetcher.threads.per.host or the renamed fetcher.threads.per.queue property to more than the edefault 1 , would my cralwer still be with in the crawl-delay limits for each host as specified in