what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinh...@thord:bin/nutch readdb crawldb -url url
it has this status, if it is recrawled and the signature does not change.
the signature
Sorry, but how could I do this?
皮皮 wrote:
check the parse data first, maybe it parse unsuccessful.
2009/10/27 caezar caeza...@gmail.com
Hi All,
I've got a strange problem, that nutch indexes much less URLs then it
fetches. For example URL:
Thanks, that was really helpful. I've moved forward but still not found the
solution.
So the status of the initial URL
(http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
Status: 5 (db_redir_perm)
Metadata: _pst_: moved(12), lastModified=0:
yes, its permanently redirected.
you can check also the segment status of this url
here is an example
reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455
http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;
it will show you whether it is parsed and the
Thanks, checked, it was parsed. Still no answer why it was not indexed
reinhard schwab wrote:
yes, its permanently redirected.
you can check also the segment status of this url
here is an example
reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.
caezar schrieb:
Thanks, checked, it was parsed. Still no answer why it was not indexed
reinhard schwab wrote:
yes, its permanently redirected.
you
In the IndexerMapReduce.reduce there is a code:
if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
continue;
}
And the status of the redirect target URL is really linked. Thats why it's
skipped. But what
Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return; // only have inlinks
}
my page has fetchDatum, parseText and
caezar wrote:
Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return; // only have inlinks
}
my page has fetchDatum,
(cross posted to many user lists, please confine reply to gene...@lucene)
There will be a Lucene meetup next week at ApacheCon in Oakland, CA on
Tuesday, November 3rd. Meetups are free (the rest of the conference is
not). See: http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009
For
I'm pretty sure that I ran both commands before indexing
Andrzej Bialecki wrote:
caezar wrote:
Some more information. Debugging reduce method I've noticed, that before
code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return;
I've compared the segments data of the URL which have no redirect and was
indexed correctly, with this bad URL, and there is really a difference.
First one have db record in the segment:
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Oct 28 16:01:05 EET 2009
Modified time:
is your problem solved now???
this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.
regards
caezar schrieb:
I've compared
No, problem is not solved. Everything happens as you described, but page is
not indexed, because of condition:
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return; // only have inlinks
}
in
what is in the crawl db?
reinh...@thord:bin/nutch readdb crawldb -url url
caezar schrieb:
No, problem is not solved. Everything happens as you described, but page is
not indexed, because of condition:
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData
Hi,
the unsubscription message doesn't work. Please, remove me from the
list.
Thanks.
Hi,
I want to explicitly specify location of indexes folder.
As I understand I should specify some property pointing to index
location in Configuration, before executing method parse on Queue:
Query.parse(query, lang, conf)
Am I right?
Are there any other ways of doing it?
Thanks.
The unsubscription message does not work for me too.
Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and
hadoop mail list?
Subject: Please, unsubscribe me
From: nsa...@officinedigitali.it
To: nutch-user@lucene.apache.org
Date: Wed, 28 Oct 2009 16:43:05 +0100
Me too, Could you please help to remove me (cuong09m @gmail.com) from the
nutch and hadoop mail list?
-Original Message-
From: caoyuzhong [mailto:caoyuzh...@hotmail.com]
Sent: Thursday, October 29, 2009 9:49 AM
To: nutch-user@lucene.apache.org
Subject: RE: Please, unsubscribe me
The
List-Help: mailto:nutch-user-h...@lucene.apache.org
List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org
List-Post: mailto:nutch-user@lucene.apache.org
List-Id: nutch-user.lucene.apache.org
2009/10/29 Le Manh Cuong cuong...@gmail.com
Me too, Could you please help to remove me
Sorry but the last time I try to unsubscribe, It don’t work.
And now it don’t work also, :).
-Original Message-
From: SunGod [mailto:sun...@cheemer.org]
Sent: Thursday, October 29, 2009 10:09 AM
To: nutch-user@lucene.apache.org
Subject: Re: Please, unsubscribe me
List-Help:
You can put a full path in nutch-site.xml
property
namesearcher.dir/name
value/full/path/crawl/value
description
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory index
22 matches
Mail list logo