Re: Nutch indexes less pages, then it fetches

2009-11-27 Thread J. Smith
Does anybody know how to solve this problem? -- View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542690.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-11-27 Thread J. Smith
this problem? -- View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542889.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-11-27 Thread J. Smith
grateful. But also I'm curious why this heppaning... Maybe someone can explain? -- View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26543258.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092612.html Sent from the Nutch

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092907.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
hmm i have no idea now. check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093649.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
-- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093867.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki
caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum,

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26094770.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html Sent from the Nutch

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: I've compared

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
-- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
what is in the crawl db? reinh...@thord:bin/nutch readdb crawldb -url url caezar schrieb: No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition: if (fetchDatum == null || dbDatum == null || parseText == null || parseData

Nutch indexes less pages, then it fetches

2009-10-27 Thread caezar
? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

2009-10-27 Thread 皮皮
NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C

Re: Nutch indexes less pages, then it fetches

2009-10-27 Thread kevin chen
I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem,