For the specific case I was running into (on a single known domain) using
regex-urlnormalizer did the trick. Thanks!
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bia
bu8t how you are running sh scripts...
you have to use cygwin to be able to edit linux files
> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 16:09:13 -0500
> From: vijaya_pet...@sra.com
> To: nutch-user@lucene.apache.org
>
> Adam,
> I'm on windows unfortunately!!
Adam,
I'm on windows unfortunately!! I'm using cygdrive, but it doesn't
recognize vi. Any idea for opening it in windows? Notepad didn't work
either.
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's
On Thu, Dec 10, 2009 at 2:55 PM, BELLINI ADAM wrote:
>
> hi,
>
> thx for these informations, but since i'm using solr index, and when i make a
> search i get a blank result...
> for example if i will have 10 documents as a search result, 9 will be ok
> (because i display the title and 4 first l
On 2009-12-10 19:59, Jesse Hires wrote:
I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the
jus use vi or vim
i use vi to edit the file
> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 15:58:24 -0500
> From: vijaya_pet...@sra.com
> To: nutch-user@lucene.apache.org
>
> Adam,
> What do I use to open a CRC file? I tried QuickSFV. Thanks in advance!
>
> Vi
Adam,
What do I use to open a CRC file? I tried QuickSFV. Thanks in advance!
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please co
On 2009-12-10 20:33, Kirby Bohling wrote:
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM wrote:
hi,
i have a page with, now i know
that nutch obey to this tag because i dont find the content and the title in my index, but i was
wondering that this document will not be present in the index.
hi,
thx for these informations, but since i'm using solr index, and when i make a
search i get a blank result...
for example if i will have 10 documents as a search result, 9 will be ok
(because i display the title and 4 first lines of content), but i obtain one
blank result becoz of this pag
it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this folder
> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 14:26:30 -0500
> From: vijaya_pet...@sra.com
> To: nutch-user@lucene.apache.org
>
> Adam,
> I tried runni
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM wrote:
>
> hi,
>
> i have a page with , now i
> know that nutch obey to this tag because i dont find the content and the
> title in my index, but i was wondering that this document will not be present
> in the index. why he keep the document in my
Adam,
I tried running that command and get the following (it created a
whole_db directory, but it's not dumping out the contents to the
console):
$ bin/nutch readdb crawl/crawldb/ -dump whole_db
CrawlDb dump: starting
CrawlDb db: crawl/crawldb/
CrawlDb dump: done
Vijaya Peters
SRA International,
I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend d
hi,
check the fetch time in your crawldb...you can dump all the crawldb like this:
./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
entries will look like this:
http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31 19:00
hi,
i have a page with , now i
know that nutch obey to this tag because i dont find the content and the title
in my index, but i was wondering that this document will not be present in the
index. why he keep the document in my index with no title and no content ??
i'm using index-basic and in
There is a domain-url filter. Is that what you were looking for?
Dennis
Yves Petinot wrote:
Hi Bhavin,
other nutch users may comment on this, but it seems to me that working
on top of the nutchbase branch might allow you to perform that type of
processing quite easily.
-y
bhavin pandya w
Hi Bhavin,
other nutch users may comment on this, but it seems to me that working
on top of the nutchbase branch might allow you to perform that type of
processing quite easily.
-y
bhavin pandya wrote:
Hi,
I have setup nutch 1.0 on cluster of 3 nodes.
We are running two application.
1. N
17 matches
Mail list logo