from:"Pravin Karne"

RE: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Pravin Karne

use two nutch directory and of course two different crawl directory because with hadoop they will end-up on the same hdfs: (assuming you run in distribued or pseudo) 2010/3/9, Pravin Karne pravin_ka...@persistent.co.in: Can we share Hadoop cluster between two nutch instance. So there will be two

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne

Hi guys any pointer on following. Your help will highly appreciated . Thanks -Pravin -Original Message- From: Pravin Karne Sent: Friday, March 05, 2010 12:57 PM To: nutch-user@lucene.apache.org Subject: Two Nutch parallel crawl with two conf folder. Hi, I want to do two Nutch parallel

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne

@lucene.apache.org Subject: Re: Two Nutch parallel crawl with two conf folder. How parallel is parallel in your case ? Don't forget Hadoop in distributed mode will serialize your jobs anyhow. For the rest why don't you create two Nutch directories and run things totally independently 2010/3/8, Pravin Karne

Two Nutch parallel crawl with two conf folder.

2010-03-04 Thread Pravin Karne

Hi, I want to do two Nutch parallel crawl with two conf folder. I am using crawl command to do this. I have two separate conf folders, all files from conf are same except crawl-urlfilter.txt . In this file we have different filters(domain filters). e.g . 1 st conf have -

How to add sitemp attribute to crawldb while fetching

2010-02-18 Thread Pravin Karne

Hi, Sitemap.xml contains URLinfo for updatefrequency and lastmodify . So , while fetching the URLs, can we update crawldatum with above values. So long run crawl will have upadated information every time. No need to re-crawl for updated links By default this value is the 30 days(my

Cookies isue in nutch...

2010-02-15 Thread Pravin Karne

Hi , I am trying to cookie based authentication for nutch fetching. I want to fetch one page who required login credentials. I have valid cookies for this credentials. If I used this cookie in my stand alone application I am getting authenticated response(or required web page). But when I am

Why Nutch is not crawling all links from web page

2009-09-22 Thread Pravin Karne

Hi, I am using nutch to crawl particular site. But I found that Nutch is not crawling all links from every pages. Is there any tuning parameter for nutch to crawl all links? Thanks in advance -Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which

Nutch is not crawling all outlinks

2009-09-22 Thread Pravin Karne

Hi Nutch is not crawling all outlinks even with following property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be

what is Non DFS Used in cluster summary ?how to delete it?

2009-07-06 Thread Pravin Karne

Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :

what is Non DFS Used in cluster summary? how to delete Non DFS Used data

2009-07-06 Thread Pravin Karne

Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :

RE: Two Nutch parallel crawl with two conf folder.

RE: Two Nutch parallel crawl with two conf folder.

RE: Two Nutch parallel crawl with two conf folder.

Two Nutch parallel crawl with two conf folder.

How to add sitemp attribute to crawldb while fetching

Cookies isue in nutch...

Why Nutch is not crawling all links from web page

Nutch is not crawling all outlinks

what is Non DFS Used in cluster summary ?how to delete it?

what is Non DFS Used in cluster summary? how to delete Non DFS Used data

10 matches

Site Navigation

Mail list logo

Footer information