RE: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Pravin Karne
use two nutch directory and of course two different crawl directory because with hadoop they will end-up on the same hdfs: (assuming you run in distribued or pseudo) 2010/3/9, Pravin Karne pravin_ka...@persistent.co.in: Can we share Hadoop cluster between two nutch instance. So there will be two

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne
Hi guys any pointer on following. Your help will highly appreciated . Thanks -Pravin -Original Message- From: Pravin Karne Sent: Friday, March 05, 2010 12:57 PM To: nutch-user@lucene.apache.org Subject: Two Nutch parallel crawl with two conf folder. Hi, I want to do two Nutch parallel

RE: Two Nutch parallel crawl with two conf folder.

2010-03-08 Thread Pravin Karne
@lucene.apache.org Subject: Re: Two Nutch parallel crawl with two conf folder. How parallel is parallel in your case ? Don't forget Hadoop in distributed mode will serialize your jobs anyhow. For the rest why don't you create two Nutch directories and run things totally independently 2010/3/8, Pravin Karne

Two Nutch parallel crawl with two conf folder.

2010-03-04 Thread Pravin Karne
Hi, I want to do two Nutch parallel crawl with two conf folder. I am using crawl command to do this. I have two separate conf folders, all files from conf are same except crawl-urlfilter.txt . In this file we have different filters(domain filters). e.g . 1 st conf have -

How to add sitemp attribute to crawldb while fetching

2010-02-18 Thread Pravin Karne
Hi, Sitemap.xml contains URLinfo for updatefrequency and lastmodify . So , while fetching the URLs, can we update crawldatum with above values. So long run crawl will have upadated information every time. No need to re-crawl for updated links By default this value is the 30 days(my

Cookies isue in nutch...

2010-02-15 Thread Pravin Karne
Hi , I am trying to cookie based authentication for nutch fetching. I want to fetch one page who required login credentials. I have valid cookies for this credentials. If I used this cookie in my stand alone application I am getting authenticated response(or required web page). But when I am

Why Nutch is not crawling all links from web page

2009-09-22 Thread Pravin Karne
Hi, I am using nutch to crawl particular site. But I found that Nutch is not crawling all links from every pages. Is there any tuning parameter for nutch to crawl all links? Thanks in advance -Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which

Nutch is not crawling all outlinks

2009-09-22 Thread Pravin Karne
Hi Nutch is not crawling all outlinks even with following property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be

what is Non DFS Used in cluster summary ?how to delete it?

2009-07-06 Thread Pravin Karne
Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :

what is Non DFS Used in cluster summary? how to delete Non DFS Used data

2009-07-06 Thread Pravin Karne
Hi, I am using Nutch 1.0 with 10 node cluster. I have crawled 1000 sites with 10 depth. I got following cluster summary 5538 files and directories, 4556 blocks = 10094 total. Heap Size is 50 MB / 888.94 MB (5%) Configured Capacity : 140 TB DFS Used :