Does anybody know how to solve this problem?
--
View this message in context:
http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542690.html
Sent from the Nutch - User mailing list archive at Nabble.com.
this problem?
--
View this message in context:
http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542889.html
Sent from the Nutch - User mailing list archive at Nabble.com.
grateful.
But also I'm curious why this heppaning... Maybe someone can explain?
--
View this message in context:
http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26543258.html
Sent from the Nutch - User mailing list archive at Nabble.com.
what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinh...@thord:bin/nutch readdb crawldb -url url
it has this status, if it is recrawled and the signature does not change.
the signature
://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092612.html
Sent from the Nutch
this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092907.html
Sent from the Nutch - User mailing list archive at Nabble.com.
yes, its permanently redirected.
you can check also the segment status of this url
here is an example
reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455
http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;
it will show you whether it is parsed and the
://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html
Sent from the Nutch - User mailing list archive at Nabble.com.
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.
caezar schrieb:
Thanks, checked, it was parsed. Still no answer why it was not indexed
reinhard schwab wrote:
yes, its permanently redirected.
you
://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093649.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093867.html
Sent from the Nutch - User mailing list archive at Nabble.com.
caezar wrote:
Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return; // only have inlinks
}
my page has fetchDatum,
Integration
http://www.sigram.com Contact: info at sigram dot com
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26094770.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html
Sent from the Nutch
is your problem solved now???
this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.
regards
caezar schrieb:
I've compared
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html
Sent from the Nutch - User mailing list archive at Nabble.com.
what is in the crawl db?
reinh...@thord:bin/nutch readdb crawldb -url url
caezar schrieb:
No, problem is not solved. Everything happens as you described, but page is
not indexed, because of condition:
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData
? Is there a way to browse crawldb
to ensure that page really fetched? What else could I check?
Thanks
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
Sent from the Nutch - User mailing list archive at Nabble.com.
NutchIndexWriter and it logs every page for witch it's write method
executed). What could be possible reason? Is there a way to browse crawldb
to ensure that page really fetched? What else could I check?
Thanks
--
View this message in context:
http://www.nabble.com/Nutch-indexes-less-pages%2C
I have similar experience.
Reinhard schwab responded a possible fix. See mail in this group from
Reinhard schwab at
Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT)
I haven't have chance to try it out yet.
On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
Hi All,
I've got a strange problem,
20 matches
Mail list logo