Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
Sorry, but how could I do this? 皮皮 wrote: check the parse data first, maybe it parse unsuccessful. 2009/10/27 caezar caeza...@gmail.com Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL:

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0:

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
hmm i have no idea now. check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
In the IndexerMapReduce.reduce there is a code: if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { continue; } And the status of the redirect target URL is really linked. Thats why it's skipped. But what

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki
caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum,

[ANNOUNCE] Lucene MeetUp in Oakland, CA - Tue Nov 3rd @ 8PM

2009-10-28 Thread Chris Hostetter
(cross posted to many user lists, please confine reply to gene...@lucene) There will be a Lucene meetup next week at ApacheCon in Oakland, CA on Tuesday, November 3rd. Meetups are free (the rest of the conference is not). See: http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009 For

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
I'm pretty sure that I ran both commands before indexing Andrzej Bialecki wrote: caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return;

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
I've compared the segments data of the URL which have no redirect and was indexed correctly, with this bad URL, and there is really a difference. First one have db record in the segment: Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time:

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: I've compared

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar
No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition: if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } in

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
what is in the crawl db? reinh...@thord:bin/nutch readdb crawldb -url url caezar schrieb: No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition: if (fetchDatum == null || dbDatum == null || parseText == null || parseData

Please, unsubscribe me

2009-10-28 Thread Nico Sabbi
Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks.

How to specify in webapp where to find indexes?

2009-10-28 Thread Dmitriy Fundak
Hi, I want to explicitly specify location of indexes folder. As I understand I should specify some property pointing to index location in Configuration, before executing method parse on Queue: Query.parse(query, lang, conf) Am I right? Are there any other ways of doing it? Thanks.

RE: Please, unsubscribe me

2009-10-28 Thread caoyuzhong
The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100

RE: Please, unsubscribe me

2009-10-28 Thread Le Manh Cuong
Me too, Could you please help to remove me (cuong09m @gmail.com) from the nutch and hadoop mail list? -Original Message- From: caoyuzhong [mailto:caoyuzh...@hotmail.com] Sent: Thursday, October 29, 2009 9:49 AM To: nutch-user@lucene.apache.org Subject: RE: Please, unsubscribe me The

Re: Please, unsubscribe me

2009-10-28 Thread SunGod
List-Help: mailto:nutch-user-h...@lucene.apache.org List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org List-Post: mailto:nutch-user@lucene.apache.org List-Id: nutch-user.lucene.apache.org 2009/10/29 Le Manh Cuong cuong...@gmail.com Me too, Could you please help to remove me

RE: Please, unsubscribe me

2009-10-28 Thread Le Manh Cuong
Sorry but the last time I try to unsubscribe, It don’t work. And now it don’t work also, :). -Original Message- From: SunGod [mailto:sun...@cheemer.org] Sent: Thursday, October 29, 2009 10:09 AM To: nutch-user@lucene.apache.org Subject: Re: Please, unsubscribe me List-Help:

Re: How to specify in webapp where to find indexes?

2009-10-28 Thread kevin chen
You can put a full path in nutch-site.xml property namesearcher.dir/name value/full/path/crawl/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index