ts there's nothing to distribute because all
> URLs of one domain/host are processed in one fetcher task to ensure
> politeness.
>
> Best,
> Sebastian
>
> On 11/1/19 6:53 AM, Sachin Mittal wrote:
> > Hi,
> > I understood the point.
> > I would also like to run nutch
Hi,
I understood the point.
I would also like to run nutch on my local machine.
So far I am running in standalone mode with default crawl script where
fetch time limit is 180 minutes.
What I have observed is that it usually fetches, parses and indexes 1800
web pages.
I am basically fetching the
Hi,
I have been running nutch in local mode and so far I am able to have a good
understanding on how it all works.
I wanted to start with distributed crawling using some public cloud
provider.
I just wanted to know if fellow users have any experience in setting up
nutch for distributed crawling.
ndatory to update the CrawlDb (command "updatedb") for each
> segment which transfers the fetch status information (fetch time, HTTP
> status, signature, etc.) from
> the segment to the CrawlDb.
>
> Best,
> Sebastian
>
> On 10/22/19 6:59 AM, Sachin Mittal wrote:
>
Hi,
I have checked the regex-urlfilter and by default I see this line:
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
In my case for a particular url I want to crawl a specific query, so wanted
to know what file would be the best to make changes to enable this.
; -noFilter flag from generate_args in the crawl script. I missed that, since
> I don't use this script.
> (Generally, always treat Sebastian's answers as The Best Answers!)
>
> Yossi.
>
> -----Original Message-
> From: Sachin Mittal
> Sent: Friday, 18 October 2019 17
cycle (from previous crawl cycle's
outlinks) does not seem to be applying the url filters defined in
urlfilter-regex
Thanks
Sachin
On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal wrote:
> Hi,
>
> Thanks I figured this out. Lets hope it works!.
>
> urlfilter-regex is required to filt
Hi,
I was bit confused on the outlinks generated from a parsed url.
If I use the utility:
bin/nutch parsechecker url
The generated outlinks has all the outlinks.
However if I check the dump of parsed segment generated using nutch crawl
script using command:
bin/nutch readseg -dump
8 matches
Mail list logo