: Meraj A. Khan mera...@gmail.com
To: user@nutch.apache.org
Sent: Saturday, February 28, 2015 12:09:47 AM
Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a
browser?
Hi Jorge,
Yes, I was exploring changing the http.agent.name property value in
case where the sites either
in
traffic.
Regards,
- Original Message -
From: Meraj A. Khan mera...@gmail.com
To: user@nutch.apache.org
Sent: Saturday, February 28, 2015 12:09:47 AM
Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a
browser?
Hi Jorge,
Yes, I was exploring changing
Can you please set the user agent to something that resembles a
browser like Chrome for example and test? I just posted a query
yesterday for a similar issue where the mobile version of the site
gets served up instead of 500.
On Fri, Feb 27, 2015 at 1:08 PM, Iain Lopata ilopa...@hotmail.com
In some instances the content that is downloaded in Fetch phase from a
HTTP URL is not what you would get if you were to access the request
from a well known browser like Google Chrome for example, that is
because the server is expecting a user agent value that represents a
browser.
There is a
Hi Folks,
I am facing the exact same problem that is described in JIRA NUTCH-762
, i.e the generate -updates takes excessive amount of time and the
actual fetch only takes very less time compared to the generate time.
The Jira issue commits a patch to allow generating of multiple
segments in a
Shadi,
I am not sure what will be the case if example.com itself has external
links,I think it will fetch those with depth 1,but if you want to disbale
the fetching of external links , just set the external.links property to
false,you dont need any url filter set up if you do so.
On Jan 4, 2015
...@intel.com
wrote:
How can I configure number of map reduce? Which parameter is it? More
map reduce will make it slower or faster?
Thanks
-Original Message-
From: Meraj A. Khan [mailto:mera...@gmail.com]
Sent: Thursday, January 01, 2015 15:17
To: user@nutch.apache.org
Subject: Re: Nutch
, 2014 at 11:18 AM, Meraj A. Khan mera...@gmail.com wrote:
Hi All,
I have a quick question regarding the db.default.fetch.interval
parameter , I have currently set it to 15 days , however my crawl
cycle itself is going beyond 15 days and upto 30 days , now I was not
sure since I have set
It seems kind of slower for 20k links, how many map and reduce tasks ,have
you configured for each one of the pahses in a Nutch crawl.
On Jan 1, 2015 6:00 AM, Chaushu, Shani shani.chau...@intel.com wrote:
Hi all,
I wanted to know how long nutch should run.
I change the configurations, and
/libs/script-runner/script-runner.jar
Regards
Adil I. Abbasi
On Thu, Jan 1, 2015 at 8:51 PM, Meraj A. Khan mera...@gmail.com wrote:
Can you give us the command that you use to start the crawl?
On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com
wrote:
When I try to nutch
Can you give us the command that you use to start the crawl?
On Jan 1, 2015 10:28 AM, Adil Ishaque Abbasi aiabb...@gmail.com wrote:
When I try to nutch crawl script on amazon emr, it gives me this error
/mnt/var/lib/hadoop/steps/s-3VT1QRVSURPSH/./crawl: line 81:
hdfs:///nutch/bin/nutch: No
Hi All,
I have a quick question regarding the db.default.fetch.interval
parameter , I have currently set it to 15 days , however my crawl
cycle itself is going beyond 15 days and upto 30 days , now I was not
sure since I have set the db.default.fetch.interval to be only 15 days
, is there a
I installed it by copying the files to conf directory, never tried without
that step to confirm if the copying is really needed.
On Nov 12, 2014 6:24 AM, mikejf12 i...@semtech-solutions.co.nz wrote:
Hi
I installed two version of Nutch on to a Centos 6 Linux Hadoop V1.2.1
cluster. I didnt
, if the data is to be pushed to Solr (e.g. with
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT),
then after indexing is done you can get rid of the segment
On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan mera...@gmail.com wrote:
Thanks .
How do I definitively determine
accordingly.
Julien
On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:
Julien,
On further analysis , I found that it was not a delay at reduce time ,
but
a long running fetch map task , when I have multiple fetch map tasks
running on a single segment
Hi All,
I am deleting the segments as soon as they are fetched and parsed , I
have read in previous posts that it is safe to delete the segments
only if it is older than the db.default.fetch.interval , my
understanding is that one does have to wait for the segment to be
older than
is computed after updatedb is isssued with that
segment
So as long as you don't need the parsed data anymore then you can delete
the segment (e.g. after indexing through Solr...).
On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote:
Hi All,
I am deleting the segments as soon
Hi All,
I am running the bin/crawl script that comes with Nutch 1.7 on Hadoop YARN
by redirecting its output to a log file as shown below.
/opt/bitconfig/nutch/deploy/bin/crawl /urls crawldirectory 2000
/tmp/nutch.log 21
The issue I am facing is that randomly this script when it is running a
for the map tasks accordingly.
Julien
On 26 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:
Julien,
On further analysis , I found that it was not a delay at reduce time ,
but
a long running fetch map task , when I have multiple fetch map tasks
running on a single segment
Fred,
In my last email on this topic , I mentioned that I am using a single
segment and multiple fetch map tasks, and also the changes that I had to
make to Nutch 1.7 to make it possible on YARN.
Let me know if you cannot find it and I ll resend those again.
Meraj.
On Fri, Oct 24, 2014 at
19:08, Meraj A. Khan mera...@gmail.com wrote:
Julien,
Thanks for your suggestion , I looked at the jstack thread dumps , and I
could see that the fetcher threads are in a waiting state and actually
the
map phase is not yet complete looking at the JobClient console.
14/10/15 12:09:48
to see what it
is busy doing, that will be a simple of way of checking that this is indeed
the source of the problem.
See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible
solution
J.
On 16 October 2014 06:08, Meraj A. Khan mera...@gmail.com wrote:
Hi All,
I am running
Hi All,
I am running into a situation where the reduce phase of the fetch job with
parsing enabled at the time of fetch is taking excessively long amount of
time , I have seen recommendations to filter the URLs based on length to
avoid normalization related delays ,I am not filtering any URLs
Markus,
I have been using Nucth for a while , but I wasnt clear about this issue,
thank you for reminding me that this is Nucth 101 :)
I will go ahead and use topN as the segment size control mechanism,
although I have one question regarding topN , i.e if I have topN value of
1000 and if there
Hi Folks,
I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
controlling the segment size and since a single segment is being created
which is very large for the capacity of my Hadoop cluster, I have a
available storage of ~3TB , but since Hadoop generates the spill*.out
24, 2014 at 6:14 PM, Meraj A. Khan mera...@gmail.com wrote:
Folks,
As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
cluster .
In order to scale I would need to Fetch concurrently with multiple map
tasks on multiple nodes ,I think that the first step to do so would
Folks,
As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
cluster .
In order to scale I would need to Fetch concurrently with multiple map
tasks on multiple nodes ,I think that the first step to do so would be to
generate multiple segments in the generate phase so that
Hi Edoardo,
How do you generate the multiple segments at the time of generate phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com
wrote:
Hi all,
I’m building an Oozie workflow to schedule the generate, fetch, etc…
workflow. Right now I'm trying to feed the list of
can get more.
Best,
Edoardo
On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
wrote:
Hi Edoardo,
How do you generate the multiple segments at the time of generate phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano
edoardo.causar...@gmail.com
wrote:
Hi
get more.
Best,
Edoardo
On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
wrote:
Hi Edoardo,
How do you generate the multiple segments at the time of generate
phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano
edoardo.causar
logic for detecting that they've all finished before you move on to the
update step.
Out of curiosity : why do you want to fetch multiple segments at the same
time?
On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote:
Hello Folks,
I am unable to run multiple fetch Map taks
in your cluster.
Cheers
Jake
On Sep 19, 2014, at 1:52 PM, Meraj A. Khan mera...@gmail.com wrote:
Julien,
How would you achieve parallelism then on a Hadoop cluster , am I missing
something here? My understanding was that we could scale the crawl by
allowing fetch to happen in multiple
Markus,
Thanks, the issue I was setting the PATH variable in the bin/crawl script
and once I removed it and set it outside of the bin/crawl script , it
started working fine now.
On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma markus.jel...@openindex.io
wrote:
Hi - you made Nutch believe that
and
generate exact one mapper, although I had changed mode=distributed, any
idea about this please?
Many regards,
Simon
On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan mera...@gmail.com wrote:
I think that is a typo , and it is actually CrawlDirectory. For the
single
map task issue although I
/deploy/bin and run the script from there.
Julien
On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:
Hi Julien,
I have 15 domains and they are all being fetched in a single map
task
which
does not fetch all the urls no matter what
/NutchTutorial#A3.3._Using_the_crawl_script
just go to runtime/deploy/bin and run the script from there.
Julien
On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:
Hi Julien,
I have 15 domains and they are all being fetched in a single map
with the -numFetchers parameter in the generation
step.
Why don't you use the crawl script in /bin instead of tinkering with the
(now deprecated) Crawl class? It comes with a good default configuration
and should make your life easier.
Julien
On 28 August 2014 06:47, Meraj A. Khan mera...@gmail.com
Hi All,
I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that there
is only a single reducer in the generate partition job. I am running in a
situation where the subsequent fetch is only running in a single map task
(I believe as a consequence of a single reducer in the earlier
Hi All,
After spending some time on this I was able to diagnose the problem that
when I submit the Nutch 1.7 job to a Hadoop Yarn Cluster , I notice that in
the Hadoop UI , it lists the tasks that its executing , only 3 rounds of
fetch happen , even though I have given a depth on 100 and my seed
Perfect, thank you Julien!
On Thu, Jun 26, 2014 at 10:21 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
If I set fetcher.threads.per.queue property to more than 1 , I believe
the
behavior would be to have those many number of threads per host from
Nutch,
in that case would
Hello Folks,
I have noticed that Nutch resources and mailing lists are mostly geared
towards the usage of Nutch in research oriented projects , I would like to
know from those of you who are using Nutch in production for large scale
crawling (vertical or non-vertical) about what challenges to
probably
would not block access, and by Nutch variant , I meant an instance of a
customized crawler based on Nutch.
Thanks.
On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty g...@mimirtech.com wrote:
On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote:
Hello Folks,
I have noticed
Sebastian,
Thanks for the clear explanation , I have a similar questions .
1. If I set the fetcher.threads.per.host or the renamed
fetcher.threads.per.queue property to more than the edefault 1 , would my
cralwer still be with in the crawl-delay limits for each host as specified
in
43 matches
Mail list logo