Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sachin Mittal
OK understood.
I am using nutch defaults and they are set optimally especially for polite
crawling.
I am indeed right now crawling just one host and given the defaults the
throughput is what it should be.

Yes one need not to be aggressive here and just be patient.

I think no where in near future I would have over 10M urls to crawl for
1000s of host and local crawling is just fine in my case.
So I would just continue the way it is right now.

Thanks
Sachin




On Fri, Nov 1, 2019 at 7:36 PM Sebastian Nagel
 wrote:

> Hi Sachin,
>
> > What I have observed is that it usually fetches, parses and indexes
> > 1800 web pages.
>
> This means 10 pages per minute.
>
> How are the 1800 pages distributed over hosts?
>
> The default delay between successive fetches to the same host is
> 5 seconds. If all pages belong to the same host, the crawler is
> waiting 50 sec. every minute and the fetching is done in the remaining
> 10 sec.
>
> If you have the explicit permission to access the host(s) aggressively,
> you can decrease the delay
> (fetcher.server.delay) or even fetch in parallel from the same host
> (fetcher.threads.per.queue).
> Otherwise, please keep the delay as is and be patient and polite! You also
> risk to get blocked by
> the web admin.
>
> > What I have understood here is that in local mode there is only one
> > thread doing the fetch?
>
> No. The number of parallel threads used in bin/crawl is 50.
>  --num-threads 
> Number of threads for fetching / sitemap processing [default: 50]
>
> I can only second Markus: local mode is sufficient unless you're crawling
> - significantly more than 10M+ URLs
> - from 1000+ domains
>
> With less domains/hosts there's nothing to distribute because all
> URLs of one domain/host are processed in one fetcher task to ensure
> politeness.
>
> Best,
> Sebastian
>
> On 11/1/19 6:53 AM, Sachin Mittal wrote:
> > Hi,
> > I understood the point.
> > I would also like to run nutch on my local machine.
> >
> > So far I am running in standalone mode with default crawl script where
> > fetch time limit is 180 minutes.
> > What I have observed is that it usually fetches, parses and indexes 1800
> > web pages.
> > I am basically fetching the entire page and fetch process is one that
> takes
> > maximum time.
> >
> > I have a i7 processor with 16GB of RAM.
> >
> > How can I increase the throughput here?
> > What I have understood here is that in local mode there is only one
> thread
> > doing the fetch?
> >
> > I guess I would need multiple threads running in parallel.
> > Would running nutch in pseudo distributed mode and answer here?
> > It will then run multiple fetchers and I can increase my throughput.
> >
> > Please let me know.
> >
> > Thanks
> > Sachin
> >
> >
> >
> >
> >
> >
> > On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Hello Sachin,
> >>
> >> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> >> based provider. This is the most expensive option you have.
> >>
> >> Cheaper would be to rent some servers and install Hadoop yourself,
> getting
> >> it up and running by hand on some servers will take the better part of a
> >> day.
> >>
> >> The cheapest and easiest, and in almost all cases the best option, is
> not
> >> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> >> couple of million URLs. So unless you want to crawl many different
> domains
> >> and expect 10M+ URLs, stay local.
> >>
> >> When we first started our business almost a decade ago we rented VPSs
> >> first and then physical machines. This ran fine for some years but when
> we
> >> had the option to make some good investments, we bought our own hardware
> >> and have been scaling up the cluster ever since. And with the previous
> and
> >> most recent AMD based servers processing power became increasingly
> cheaper.
> >>
> >> If you need to scale up for long term, getting your own hardware is
> indeed
> >> the best option.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >> -Original message-
> >>> From:Sachin Mittal 
> >>> Sent: Tuesday 22nd October 2019 15:59
> >>> To: user@nutch.apache.org
> >>> Subject: Best and economical way of setting hadoop cluster for
> >> distributed crawling
> >>>
> >>> Hi,
> 

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-10-31 Thread Sachin Mittal
Hi,
I understood the point.
I would also like to run nutch on my local machine.

So far I am running in standalone mode with default crawl script where
fetch time limit is 180 minutes.
What I have observed is that it usually fetches, parses and indexes 1800
web pages.
I am basically fetching the entire page and fetch process is one that takes
maximum time.

I have a i7 processor with 16GB of RAM.

How can I increase the throughput here?
What I have understood here is that in local mode there is only one thread
doing the fetch?

I guess I would need multiple threads running in parallel.
Would running nutch in pseudo distributed mode and answer here?
It will then run multiple fetchers and I can increase my throughput.

Please let me know.

Thanks
Sachin






On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma 
wrote:

> Hello Sachin,
>
> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> based provider. This is the most expensive option you have.
>
> Cheaper would be to rent some servers and install Hadoop yourself, getting
> it up and running by hand on some servers will take the better part of a
> day.
>
> The cheapest and easiest, and in almost all cases the best option, is not
> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> couple of million URLs. So unless you want to crawl many different domains
> and expect 10M+ URLs, stay local.
>
> When we first started our business almost a decade ago we rented VPSs
> first and then physical machines. This ran fine for some years but when we
> had the option to make some good investments, we bought our own hardware
> and have been scaling up the cluster ever since. And with the previous and
> most recent AMD based servers processing power became increasingly cheaper.
>
> If you need to scale up for long term, getting your own hardware is indeed
> the best option.
>
> Regards,
> Markus
>
>
> -Original message-
> > From:Sachin Mittal 
> > Sent: Tuesday 22nd October 2019 15:59
> > To: user@nutch.apache.org
> > Subject: Best and economical way of setting hadoop cluster for
> distributed crawling
> >
> > Hi,
> > I have been running nutch in local mode and so far I am able to have a
> good
> > understanding on how it all works.
> >
> > I wanted to start with distributed crawling using some public cloud
> > provider.
> >
> > I just wanted to know if fellow users have any experience in setting up
> > nutch for distributed crawling.
> >
> > From nutch wiki I have some idea on what hardware requirements should be.
> >
> > I just wanted to know which of the public cloud providers (IaaS or PaaS)
> > are good to setup hadoop clusters on. Basically ones on which it is easy
> to
> > setup/manage the cluster and ones which are easy on budget.
> >
> > Please let me know if you folks have any insights based on your
> experiences.
> >
> > Thanks and Regards
> > Sachin
> >
>


Best and economical way of setting hadoop cluster for distributed crawling

2019-10-22 Thread Sachin Mittal
Hi,
I have been running nutch in local mode and so far I am able to have a good
understanding on how it all works.

I wanted to start with distributed crawling using some public cloud
provider.

I just wanted to know if fellow users have any experience in setting up
nutch for distributed crawling.

>From nutch wiki I have some idea on what hardware requirements should be.

I just wanted to know which of the public cloud providers (IaaS or PaaS)
are good to setup hadoop clusters on. Basically ones on which it is easy to
setup/manage the cluster and ones which are easy on budget.

Please let me know if you folks have any insights based on your experiences.

Thanks and Regards
Sachin


Re: what happens to older segments

2019-10-22 Thread Sachin Mittal
Ok.
Understood.

I had one question though is that does mergesegs by default updates the
crawldb once it merges all the segments?
Or do we have to call the updatedb command on the merged segment to update
the crawldb so that it has all the information for next cycle.

Thanks
Sachin


On Tue, Oct 22, 2019 at 1:32 PM Sebastian Nagel
 wrote:

> Hi Sachin,
>
>  > I want to know once a new segment is generated is there any use of
>  > previous segments and can they be deleted?
>
> As soon as a segment is indexed and the CrawlDb is updated from this
> segment, you may delete it. But keeping older segments allows
> - reindexing in case something went wrong with the index
> - debugging: check the HTML of a page
>
> When segments are merged only the most recent record of one URL is kept -
> saves storage space but
> requires to run the mergesegs tool.
>
>  > Also when we then start the fresh crawl cycle how do we instruct
>  > nutch to use this new merged segment, or it automatically picks up
>  > the newest segment as starting point?
>
> The CrawlDb contains all necessary information for the next cycle.
> It's mandatory to update the CrawlDb (command "updatedb") for each
> segment which transfers the fetch status information (fetch time, HTTP
> status, signature, etc.) from
> the segment to the CrawlDb.
>
> Best,
> Sebastian
>
> On 10/22/19 6:59 AM, Sachin Mittal wrote:
> > Hi,
> > I have been crawling using nutch.
> > What I have understood is that for each crawl cycle it creates a segment
> > and for the next crawl cycle it uses the outlinks from previous segment
> to
> > generate and fetch next set of urls to crawl. Then it creates a new
> segment
> > with those urls.
> >
> > I want to know once a new segment is generated is there any use of
> previous
> > segments and can they be deleted?
> >
> > I also see a command line tool  mergesegs
> > <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=122916832
> >.
> > Does it make sense to use this to merge old segments into new segment
> > before deleting old segments?
> >
> > Also when we then start the fresh crawl cycle how do we instruct nutch to
> > use this new merged segment, or it automatically picks up the newest
> > segment as starting point?
> >
> > Thanks
> > Sachin
> >
>
>


Adding specfic query parameters to nutch url filters

2019-10-21 Thread Sachin Mittal
Hi,
I have checked the regex-urlfilter and by default I see this line:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

In my case for a particular url I want to crawl a specific query, so wanted
to know what file would be the best to make changes to enable this.

Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
and fast-urlfilter.

Would adding filters in any of the later two files would help.
Any idea why these filters are added, like what would be the potential
usecase.

Also say if I add multiple filter plugins backed by these files, then how
url filtering works? Only those urls which pass all the plugins are
selected to be fetched or any of the plugin?

Thanks
Sachin


Re: Parsed segment has outlinks filtered

2019-10-19 Thread Sachin Mittal
Yes the changes Sebastian suggested seems to be working fine.
I now see all the outlinks in the parsed document and subsequent crawl of
the outlinks filters out those that do not match my regex-urlfilter.

Thanks
Sachin


On Fri, Oct 18, 2019 at 11:51 PM  wrote:

> Hi Sachin,
>
> If you're using the default crawl script, I think the answer was in
> Sebastian's email: the default seems to be to filter only in the Parse
> step. This has changed recently, so the Fetch step now filters as well, but
> only if you have the latest code. Otherwise, you need to remove the
> -noFilter flag from generate_args in the crawl script. I missed that, since
> I don't use this script.
> (Generally, always treat Sebastian's answers as The Best Answers!)
>
> Yossi.
>
> -----Original Message-
> From: Sachin Mittal 
> Sent: Friday, 18 October 2019 17:36
> To: user@nutch.apache.org
> Subject: Re: Parsed segment has outlinks filtered
>
> Hi,
> Setting the prop parse.filter.urls= false does not filter out the outlinks.
> I get all the outlinks for my parsed url. So this is working as expected.
> However it has caused something unwarranted on the FetcherThread as now it
> seems to be fetching all the urls (even ones which do not match
> urlfilter-regex).
> These urls were not fetched earlier. So what it seems to be doing is that
> when generating next set of urls, it is not applying urlfilter-regex.
>
> I will play around with noFilter option as Sebastian has mentioned and see
> if this works as expected.
>
> However any idea why the next crawl cycle (from previous crawl cycle's
> outlinks) does not seem to be applying the url filters defined in
> urlfilter-regex
>
> Thanks
> Sachin
>
>
>
> On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal  wrote:
>
> > Hi,
> >
> > Thanks I figured this out. Lets hope it works!.
> >
> > urlfilter-regex is required to filter out the urls for next crawl,
> > however I still want to index all the outlinks for my current url.
> > The reason is that I may not want nutch to crawl these outlinks in
> > next round, but I may still want some other crawler to scrape these urls.
> >
> > Sachin
> >
> >
> > On Thu, Oct 17, 2019 at 10:01 PM  wrote:
> >
> >> Hi Sachin,
> >>
> >> I'm not sure what you are trying to achieve: If you don't want to
> >> filter the outlinks, why do you enable urlfilter-regex?
> >> Anyway, if you set the property parse.filter.urls to false, the
> >> Parser will not filter outlinks at all.
> >>
> >> Yossi.
> >>
> >> -Original Message-
> >> From: Sachin Mittal 
> >> Sent: Thursday, 17 October 2019 19:15
> >> To: user@nutch.apache.org
> >> Subject: Parsed segment has outlinks filtered
> >>
> >> Hi,
> >> I was bit confused on the outlinks generated from a parsed url.
> >> If I use the utility:
> >>
> >> bin/nutch parsechecker url
> >>
> >> The generated outlinks has all the outlinks.
> >>
> >> However if I check the dump of parsed segment generated using nutch
> >> crawl script using command:
> >>
> >> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch
> >> - nogenerate -noparse -noparsetext
> >>
> >> And I review the same entry's ParseData I see it has lot fewer outlinks.
> >> Basically it has filtered out all the outlinks which did not match
> >> the regex's defined in regex-urlfilter.txt.
> >>
> >> So I want to know if there is a way to avoid this and make sure the
> >> generated outlinks in the nutch segments contains all the urls and
> >> not just the filtered ones.
> >>
> >> Even if you can point to the code where this url filtering happens
> >> for outlinks I can figure out a way to circumvent this.
> >>
> >> Thanks
> >> Sachin
> >>
> >>
>
>


Re: Parsed segment has outlinks filtered

2019-10-18 Thread Sachin Mittal
Hi,
Setting the prop parse.filter.urls= false does not filter out the outlinks.
I get all the outlinks for my parsed url. So this is working as expected.
However it has caused something unwarranted on the FetcherThread as now it
seems to be fetching all the urls (even ones which do not match
urlfilter-regex).
These urls were not fetched earlier. So what it seems to be doing is that
when generating next set of urls, it is not applying urlfilter-regex.

I will play around with noFilter option as Sebastian has mentioned and see
if this works as expected.

However any idea why the next crawl cycle (from previous crawl cycle's
outlinks) does not seem to be applying the url filters defined in
urlfilter-regex

Thanks
Sachin



On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal  wrote:

> Hi,
>
> Thanks I figured this out. Lets hope it works!.
>
> urlfilter-regex is required to filter out the urls for next crawl, however
> I still want to index all the outlinks for my current url.
> The reason is that I may not want nutch to crawl these outlinks in next
> round, but I may still want some other crawler to scrape these urls.
>
> Sachin
>
>
> On Thu, Oct 17, 2019 at 10:01 PM  wrote:
>
>> Hi Sachin,
>>
>> I'm not sure what you are trying to achieve: If you don't want to filter
>> the outlinks, why do you enable urlfilter-regex?
>> Anyway, if you set the property parse.filter.urls to false, the Parser
>> will not filter outlinks at all.
>>
>> Yossi.
>>
>> -Original Message-
>> From: Sachin Mittal 
>> Sent: Thursday, 17 October 2019 19:15
>> To: user@nutch.apache.org
>> Subject: Parsed segment has outlinks filtered
>>
>> Hi,
>> I was bit confused on the outlinks generated from a parsed url.
>> If I use the utility:
>>
>> bin/nutch parsechecker url
>>
>> The generated outlinks has all the outlinks.
>>
>> However if I check the dump of parsed segment generated using nutch crawl
>> script using command:
>>
>> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
>> nogenerate -noparse -noparsetext
>>
>> And I review the same entry's ParseData I see it has lot fewer outlinks.
>> Basically it has filtered out all the outlinks which did not match the
>> regex's defined in regex-urlfilter.txt.
>>
>> So I want to know if there is a way to avoid this and make sure the
>> generated outlinks in the nutch segments contains all the urls and not just
>> the filtered ones.
>>
>> Even if you can point to the code where this url filtering happens for
>> outlinks I can figure out a way to circumvent this.
>>
>> Thanks
>> Sachin
>>
>>


Parsed segment has outlinks filtered

2019-10-17 Thread Sachin Mittal
Hi,
I was bit confused on the outlinks generated from a parsed url.
If I use the utility:

bin/nutch parsechecker url

The generated outlinks has all the outlinks.

However if I check the dump of parsed segment generated using nutch crawl
script using command:

bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
nogenerate -noparse -noparsetext

And I review the same entry's ParseData I see it has lot fewer outlinks.
Basically it has filtered out all the outlinks which did not match the
regex's defined in regex-urlfilter.txt.

So I want to know if there is a way to avoid this and make sure the
generated outlinks in the nutch segments contains all the urls and not just
the filtered ones.

Even if you can point to the code where this url filtering happens for
outlinks I can figure out a way to circumvent this.

Thanks
Sachin