Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sachin Mittal
ts there's nothing to distribute because all > URLs of one domain/host are processed in one fetcher task to ensure > politeness. > > Best, > Sebastian > > On 11/1/19 6:53 AM, Sachin Mittal wrote: > > Hi, > > I understood the point. > > I would also like to run nutch

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-10-31 Thread Sachin Mittal
Hi, I understood the point. I would also like to run nutch on my local machine. So far I am running in standalone mode with default crawl script where fetch time limit is 180 minutes. What I have observed is that it usually fetches, parses and indexes 1800 web pages. I am basically fetching the

Best and economical way of setting hadoop cluster for distributed crawling

2019-10-22 Thread Sachin Mittal
Hi, I have been running nutch in local mode and so far I am able to have a good understanding on how it all works. I wanted to start with distributed crawling using some public cloud provider. I just wanted to know if fellow users have any experience in setting up nutch for distributed crawling.

Re: what happens to older segments

2019-10-22 Thread Sachin Mittal
ndatory to update the CrawlDb (command "updatedb") for each > segment which transfers the fetch status information (fetch time, HTTP > status, signature, etc.) from > the segment to the CrawlDb. > > Best, > Sebastian > > On 10/22/19 6:59 AM, Sachin Mittal wrote: >

Adding specfic query parameters to nutch url filters

2019-10-21 Thread Sachin Mittal
Hi, I have checked the regex-urlfilter and by default I see this line: # skip URLs containing certain characters as probable queries, etc. -[?*!@=] In my case for a particular url I want to crawl a specific query, so wanted to know what file would be the best to make changes to enable this.

Re: Parsed segment has outlinks filtered

2019-10-19 Thread Sachin Mittal
; -noFilter flag from generate_args in the crawl script. I missed that, since > I don't use this script. > (Generally, always treat Sebastian's answers as The Best Answers!) > > Yossi. > > -----Original Message- > From: Sachin Mittal > Sent: Friday, 18 October 2019 17

Re: Parsed segment has outlinks filtered

2019-10-18 Thread Sachin Mittal
cycle (from previous crawl cycle's outlinks) does not seem to be applying the url filters defined in urlfilter-regex Thanks Sachin On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal wrote: > Hi, > > Thanks I figured this out. Lets hope it works!. > > urlfilter-regex is required to filt

Parsed segment has outlinks filtered

2019-10-17 Thread Sachin Mittal
Hi, I was bit confused on the outlinks generated from a parsed url. If I use the utility: bin/nutch parsechecker url The generated outlinks has all the outlinks. However if I check the dump of parsed segment generated using nutch crawl script using command: bin/nutch readseg -dump