Re: domain vs www.domain?

2009-12-10 Thread Jesse Hires
For the specific case I was running into (on a single known domain) using regex-urlnormalizer did the trick. Thanks! Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bia

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
bu8t how you are running sh scripts... you have to use cygwin to be able to edit linux files > Subject: RE: how to force nutch to do a recrawl > Date: Thu, 10 Dec 2009 16:09:13 -0500 > From: vijaya_pet...@sra.com > To: nutch-user@lucene.apache.org > > Adam, > I'm on windows unfortunately!!

RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam, I'm on windows unfortunately!! I'm using cygdrive, but it doesn't recognize vi. Any idea for opening it in windows? Notepad didn't work either. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Kirby Bohling
On Thu, Dec 10, 2009 at 2:55 PM, BELLINI ADAM wrote: > > hi, > > thx for these informations, but since i'm using solr index, and when i make a > search i get a blank result... > for example if i will have 10 documents as  a search result, 9 will be ok > (because i display the title and 4 first l

Re: domain vs www.domain?

2009-12-10 Thread Andrzej Bialecki
On 2009-12-10 19:59, Jesse Hires wrote: I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
jus use vi or vim i use vi to edit the file > Subject: RE: how to force nutch to do a recrawl > Date: Thu, 10 Dec 2009 15:58:24 -0500 > From: vijaya_pet...@sra.com > To: nutch-user@lucene.apache.org > > Adam, > What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! > > Vi

RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive years P Please co

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Andrzej Bialecki
On 2009-12-10 20:33, Kirby Bohling wrote: On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM wrote: hi, i have a page with, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index.

RE: NOINDEX, NOFOLLOW

2009-12-10 Thread BELLINI ADAM
hi, thx for these informations, but since i'm using solr index, and when i make a search i get a blank result... for example if i will have 10 documents as a search result, 9 will be ok (because i display the title and 4 first lines of content), but i obtain one blank result becoz of this pag

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder > Subject: RE: how to force nutch to do a recrawl > Date: Thu, 10 Dec 2009 14:26:30 -0500 > From: vijaya_pet...@sra.com > To: nutch-user@lucene.apache.org > > Adam, > I tried runni

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Kirby Bohling
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM wrote: > > hi, > > i have a page with , now i > know that nutch obey to this tag because i dont find the content and the > title in my index, but i was wondering that this document will not be present > in the index. why he keep the document in my

RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International,

domain vs www.domain?

2009-12-10 Thread Jesse Hires
I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend d

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00

NOINDEX, NOFOLLOW

2009-12-10 Thread BELLINI ADAM
hi, i have a page with , now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and in

Re: How to get all the crawled pages for perticular domain

2009-12-10 Thread Dennis Kubes
There is a domain-url filter. Is that what you were looking for? Dennis Yves Petinot wrote: Hi Bhavin, other nutch users may comment on this, but it seems to me that working on top of the nutchbase branch might allow you to perform that type of processing quite easily. -y bhavin pandya w

Re: How to get all the crawled pages for perticular domain

2009-12-10 Thread Yves Petinot
Hi Bhavin, other nutch users may comment on this, but it seems to me that working on top of the nutchbase branch might allow you to perform that type of processing quite easily. -y bhavin pandya wrote: Hi, I have setup nutch 1.0 on cluster of 3 nodes. We are running two application. 1. N