Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinh...@thord:bin/nutch readdb  crawldb -url url

it has this status, if it is recrawled and the signature does not change.
the signature is MD5 hash of the content.

another reason may be that you have some indexing filters.
i dont believe its the reason here.

regards


kevin chen schrieb:
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it mentioned only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 


   



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

Sorry, but how could I do this?

皮皮 wrote:
 
 check the parse data first, maybe it parse unsuccessful.
 
 2009/10/27 caezar caeza...@gmail.com
 

 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 --
 View this message in context:
 http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
 Sent from the Nutch - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092612.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

Thanks, that was really helpful. I've moved forward but still not found the
solution.
So the status of the initial URL
(http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
Status: 5 (db_redir_perm)
Metadata: _pst_: moved(12), lastModified=0:
http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

So it answers the question, why initial page was not indexed - because it
was redirected.
Now checking the status of redirect target:
Status: 2 (db_fetched)

So it was sucessfully fetchet. But, according to indexing log - it still was
not sent to indexer!



reinhard schwab wrote:
 
 what is the db status of this url in your crawl db?
 if it is STATUS_DB_NOTMODIFIED,
 then it may be the reason.
 (you can check it if you dump your crawl db with
 reinh...@thord:bin/nutch readdb  crawldb -url url
 
 it has this status, if it is recrawled and the signature does not change.
 the signature is MD5 hash of the content.
 
 another reason may be that you have some indexing filters.
 i dont believe its the reason here.
 
 regards
 
 
 kevin chen schrieb:
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 


   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092907.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
yes, its permanently redirected.
you can check also the segment status of this url
here is an example

reinh...@thord:bin/nutch  readseg -get crawl/segments/20091028122455
http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;

it will show you whether it is parsed and the extracted outlinks.
it will show any data related to this url stored in the segment.

regards

caezar schrieb:
 Thanks, that was really helpful. I've moved forward but still not found the
 solution.
 So the status of the initial URL
 (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
 Status: 5 (db_redir_perm)
 Metadata: _pst_: moved(12), lastModified=0:
 http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

 So it answers the question, why initial page was not indexed - because it
 was redirected.
 Now checking the status of redirect target:
 Status: 2 (db_fetched)

 So it was sucessfully fetchet. But, according to indexing log - it still was
 not sent to indexer!



 reinhard schwab wrote:
   
 what is the db status of this url in your crawl db?
 if it is STATUS_DB_NOTMODIFIED,
 then it may be the reason.
 (you can check it if you dump your crawl db with
 reinh...@thord:bin/nutch readdb  crawldb -url url

 it has this status, if it is recrawled and the signature does not change.
 the signature is MD5 hash of the content.

 another reason may be that you have some indexing filters.
 i dont believe its the reason here.

 regards


 kevin chen schrieb:
 
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 
 
   
   

 

   



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

Thanks, checked, it was parsed. Still no answer why it was not indexed

reinhard schwab wrote:
 
 yes, its permanently redirected.
 you can check also the segment status of this url
 here is an example
 
 reinh...@thord:bin/nutch  readseg -get crawl/segments/20091028122455
 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;
 
 it will show you whether it is parsed and the extracted outlinks.
 it will show any data related to this url stored in the segment.
 
 regards
 
 caezar schrieb:
 Thanks, that was really helpful. I've moved forward but still not found
 the
 solution.
 So the status of the initial URL
 (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
 is:
 Status: 5 (db_redir_perm)
 Metadata: _pst_: moved(12), lastModified=0:
 http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

 So it answers the question, why initial page was not indexed - because it
 was redirected.
 Now checking the status of redirect target:
 Status: 2 (db_fetched)

 So it was sucessfully fetchet. But, according to indexing log - it still
 was
 not sent to indexer!



 reinhard schwab wrote:
   
 what is the db status of this url in your crawl db?
 if it is STATUS_DB_NOTMODIFIED,
 then it may be the reason.
 (you can check it if you dump your crawl db with
 reinh...@thord:bin/nutch readdb  crawldb -url url

 it has this status, if it is recrawled and the signature does not
 change.
 the signature is MD5 hash of the content.

 another reason may be that you have some indexing filters.
 i dont believe its the reason here.

 regards


 kevin chen schrieb:
 
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it
 mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
 fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 
 
   
   

 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.

caezar schrieb:
 Thanks, checked, it was parsed. Still no answer why it was not indexed

 reinhard schwab wrote:
   
 yes, its permanently redirected.
 you can check also the segment status of this url
 here is an example

 reinh...@thord:bin/nutch  readseg -get crawl/segments/20091028122455
 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;

 it will show you whether it is parsed and the extracted outlinks.
 it will show any data related to this url stored in the segment.

 regards

 caezar schrieb:
 
 Thanks, that was really helpful. I've moved forward but still not found
 the
 solution.
 So the status of the initial URL
 (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
 is:
 Status: 5 (db_redir_perm)
 Metadata: _pst_: moved(12), lastModified=0:
 http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

 So it answers the question, why initial page was not indexed - because it
 was redirected.
 Now checking the status of redirect target:
 Status: 2 (db_fetched)

 So it was sucessfully fetchet. But, according to indexing log - it still
 was
 not sent to indexer!



 reinhard schwab wrote:
   
   
 what is the db status of this url in your crawl db?
 if it is STATUS_DB_NOTMODIFIED,
 then it may be the reason.
 (you can check it if you dump your crawl db with
 reinh...@thord:bin/nutch readdb  crawldb -url url

 it has this status, if it is recrawled and the signature does not
 change.
 the signature is MD5 hash of the content.

 another reason may be that you have some indexing filters.
 i dont believe its the reason here.

 regards


 kevin chen schrieb:
 
 
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
   
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it
 mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
 fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 
 
 
   
   
   
 
 
   
   

 

   



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

In the IndexerMapReduce.reduce there is a code:
if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
   CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
  continue;
}
And the status of the redirect target URL is really linked. Thats why it's
skipped. But what does this status mean?

reinhard schwab wrote:
 
 hmm i have no idea now.
 check the reduce method in IndexerMapReduce and add some debug
 statements there.
 recompile nutch and try it again.
 
 caezar schrieb:
 Thanks, checked, it was parsed. Still no answer why it was not indexed

 reinhard schwab wrote:
   
 yes, its permanently redirected.
 you can check also the segment status of this url
 here is an example

 reinh...@thord:bin/nutch  readseg -get crawl/segments/20091028122455
 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;

 it will show you whether it is parsed and the extracted outlinks.
 it will show any data related to this url stored in the segment.

 regards

 caezar schrieb:
 
 Thanks, that was really helpful. I've moved forward but still not found
 the
 solution.
 So the status of the initial URL
 (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
 is:
 Status: 5 (db_redir_perm)
 Metadata: _pst_: moved(12), lastModified=0:
 http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

 So it answers the question, why initial page was not indexed - because
 it
 was redirected.
 Now checking the status of redirect target:
 Status: 2 (db_fetched)

 So it was sucessfully fetchet. But, according to indexing log - it
 still
 was
 not sent to indexer!



 reinhard schwab wrote:
   
   
 what is the db status of this url in your crawl db?
 if it is STATUS_DB_NOTMODIFIED,
 then it may be the reason.
 (you can check it if you dump your crawl db with
 reinh...@thord:bin/nutch readdb  crawldb -url url

 it has this status, if it is recrawled and the signature does not
 change.
 the signature is MD5 hash of the content.

 another reason may be that you have some indexing filters.
 i dont believe its the reason here.

 regards


 kevin chen schrieb:
 
 
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group
 from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
   
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then
 it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it
 mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
 fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using
 custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 
 
 
   
   
   
 
 
   
   

 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093649.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
  return; // only have inlinks
}
my page has fetchDatum, parseText and parseData not null, but dbDatum is
null. Thats why it's skipped :) 
Any ideas about the reason?

caezar wrote:
 
 In the IndexerMapReduce.reduce there is a code:
 if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
   continue;
 }
 And the status of the redirect target URL is really linked. Thats why it's
 skipped. But what does this status mean?
 
 reinhard schwab wrote:
 
 hmm i have no idea now.
 check the reduce method in IndexerMapReduce and add some debug
 statements there.
 recompile nutch and try it again.
 
 caezar schrieb:
 Thanks, checked, it was parsed. Still no answer why it was not indexed

 reinhard schwab wrote:
   
 yes, its permanently redirected.
 you can check also the segment status of this url
 here is an example

 reinh...@thord:bin/nutch  readseg -get crawl/segments/20091028122455
 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20;

 it will show you whether it is parsed and the extracted outlinks.
 it will show any data related to this url stored in the segment.

 regards

 caezar schrieb:
 
 Thanks, that was really helpful. I've moved forward but still not
 found
 the
 solution.
 So the status of the initial URL
 (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
 is:
 Status: 5 (db_redir_perm)
 Metadata: _pst_: moved(12), lastModified=0:
 http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

 So it answers the question, why initial page was not indexed - because
 it
 was redirected.
 Now checking the status of redirect target:
 Status: 2 (db_fetched)

 So it was sucessfully fetchet. But, according to indexing log - it
 still
 was
 not sent to indexer!



 reinhard schwab wrote:
   
   
 what is the db status of this url in your crawl db?
 if it is STATUS_DB_NOTMODIFIED,
 then it may be the reason.
 (you can check it if you dump your crawl db with
 reinh...@thord:bin/nutch readdb  crawldb -url url

 it has this status, if it is recrawled and the signature does not
 change.
 the signature is MD5 hash of the content.

 another reason may be that you have some indexing filters.
 i dont believe its the reason here.

 regards


 kevin chen schrieb:
 
 
 I have similar experience.

 Reinhard schwab responded a possible fix.  See mail in this group
 from
 Reinhard schwab  at 
 Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

 I haven't have chance to try it out yet.
  
 On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
   
   
   
 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then
 it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it
 mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
 fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using
 custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse
 crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 
 
 
   
   
   
 
 
   
   

 

   
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093867.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki

caezar wrote:

Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
  return; // only have inlinks
}
my page has fetchDatum, parseText and parseData not null, but dbDatum is
null. Thats why it's skipped :) 
Any ideas about the reason?


Yes - you should run updatedb with this segment, and also run 
invertlinks with this segment, _before_ trying to index. Otherwise the 
db status won't be updated properly.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[ANNOUNCE] Lucene MeetUp in Oakland, CA - Tue Nov 3rd @ 8PM

2009-10-28 Thread Chris Hostetter


(cross posted to many user lists, please confine reply to gene...@lucene)

There will be a Lucene meetup next week at ApacheCon in Oakland, CA on 
Tuesday, November 3rd. Meetups are free (the rest of the conference is 
not). See: http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009


For other meetups at ApacheCon, see
http://www.us.apachecon.com/c/acus2009/schedule/meetups.

Also, one last reminder, we have a lot of Lucene/Solr related content 
scheduled for ApacheCon including Lucene and Solr training (Monday and 
Tuesday), two full days of Lucene related talks on Thursday and Friday, 
plus the Meetup. I also know there will be a lot of Lucene ecosystem 
committers at AC this year, so it's a great way to interact with people 
working on your favorite Lucene projects. See http://www.us.apachecon.com 
for more info on the conference.



-Hoss (channeling Grant)


Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

I'm pretty sure that I ran both commands before indexing

Andrzej Bialecki wrote:
 
 caezar wrote:
 Some more information. Debugging reduce method I've noticed, that before
 code
 if (fetchDatum == null || dbDatum == null
 || parseText == null || parseData == null) {
   return; // only have inlinks
 }
 my page has fetchDatum, parseText and parseData not null, but dbDatum is
 null. Thats why it's skipped :) 
 Any ideas about the reason?
 
 Yes - you should run updatedb with this segment, and also run 
 invertlinks with this segment, _before_ trying to index. Otherwise the 
 db status won't be updated properly.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26094770.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

I've compared the segments data of the URL which have no redirect and was
indexed correctly, with this bad URL, and there is really a difference.
First one have db record in the segment:
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Oct 28 16:01:05 EET 2009
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1256738472613
 
But second one have no such record, which seems pretty fine: it was not
added to the segment on generate stage, it was added on the fetch stage. Is
this a bug in Nutch? Or I'm missing some configuration option?

caezar wrote:
 
 I'm pretty sure that I ran both commands before indexing
 
 Andrzej Bialecki wrote:
 
 caezar wrote:
 Some more information. Debugging reduce method I've noticed, that before
 code
 if (fetchDatum == null || dbDatum == null
 || parseText == null || parseData == null) {
   return; // only have inlinks
 }
 my page has fetchDatum, parseText and parseData not null, but dbDatum is
 null. Thats why it's skipped :) 
 Any ideas about the reason?
 
 Yes - you should run updatedb with this segment, and also run 
 invertlinks with this segment, _before_ trying to index. Otherwise the 
 db status won't be updated properly.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
is your problem solved now???

this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.

regards

caezar schrieb:
 I've compared the segments data of the URL which have no redirect and was
 indexed correctly, with this bad URL, and there is really a difference.
 First one have db record in the segment:
 Crawl Generate::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Oct 28 16:01:05 EET 2009
 Modified time: Thu Jan 01 02:00:00 EET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: null
 Metadata: _ngt_: 1256738472613
  
 But second one have no such record, which seems pretty fine: it was not
 added to the segment on generate stage, it was added on the fetch stage. Is
 this a bug in Nutch? Or I'm missing some configuration option?

 caezar wrote:
   
 I'm pretty sure that I ran both commands before indexing

 Andrzej Bialecki wrote:
 
 caezar wrote:
   
 Some more information. Debugging reduce method I've noticed, that before
 code
 if (fetchDatum == null || dbDatum == null
 || parseText == null || parseData == null) {
   return; // only have inlinks
 }
 my page has fetchDatum, parseText and parseData not null, but dbDatum is
 null. Thats why it's skipped :) 
 Any ideas about the reason?
 
 Yes - you should run updatedb with this segment, and also run 
 invertlinks with this segment, _before_ trying to index. Otherwise the 
 db status won't be updated properly.


 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



   
 

   



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread caezar

No, problem is not solved. Everything happens as you described, but page is
not indexed, because of condition:
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
  return; // only have inlinks
}
in IndexerMapReduce code. For this page dbDatum is null, so it is not
indexed!

reinhard schwab wrote:
 
 is your problem solved now???
 
 this can be ok.
 new discovered urls will be added to a segment when fetched documents
 are parsed and if these urls pass the filters.
 they will not have a crawl datum Generate because they are unknown until
 they are extracted.
 
 regards
 
 caezar schrieb:
 I've compared the segments data of the URL which have no redirect and was
 indexed correctly, with this bad URL, and there is really a difference.
 First one have db record in the segment:
 Crawl Generate::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Oct 28 16:01:05 EET 2009
 Modified time: Thu Jan 01 02:00:00 EET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: null
 Metadata: _ngt_: 1256738472613
  
 But second one have no such record, which seems pretty fine: it was not
 added to the segment on generate stage, it was added on the fetch stage.
 Is
 this a bug in Nutch? Or I'm missing some configuration option?

 caezar wrote:
   
 I'm pretty sure that I ran both commands before indexing

 Andrzej Bialecki wrote:
 
 caezar wrote:
   
 Some more information. Debugging reduce method I've noticed, that
 before
 code
 if (fetchDatum == null || dbDatum == null
 || parseText == null || parseData == null) {
   return; // only have inlinks
 }
 my page has fetchDatum, parseText and parseData not null, but dbDatum
 is
 null. Thats why it's skipped :) 
 Any ideas about the reason?
 
 Yes - you should run updatedb with this segment, and also run 
 invertlinks with this segment, _before_ trying to index. Otherwise the 
 db status won't be updated properly.


 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



   
 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
what is in the crawl db?

reinh...@thord:bin/nutch readdb  crawldb -url url


caezar schrieb:
 No, problem is not solved. Everything happens as you described, but page is
 not indexed, because of condition:
 if (fetchDatum == null || dbDatum == null
 || parseText == null || parseData == null) {
   return; // only have inlinks
 }
 in IndexerMapReduce code. For this page dbDatum is null, so it is not
 indexed!

 reinhard schwab wrote:
   
 is your problem solved now???

 this can be ok.
 new discovered urls will be added to a segment when fetched documents
 are parsed and if these urls pass the filters.
 they will not have a crawl datum Generate because they are unknown until
 they are extracted.

 regards

 caezar schrieb:
 
 I've compared the segments data of the URL which have no redirect and was
 indexed correctly, with this bad URL, and there is really a difference.
 First one have db record in the segment:
 Crawl Generate::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Oct 28 16:01:05 EET 2009
 Modified time: Thu Jan 01 02:00:00 EET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: null
 Metadata: _ngt_: 1256738472613
  
 But second one have no such record, which seems pretty fine: it was not
 added to the segment on generate stage, it was added on the fetch stage.
 Is
 this a bug in Nutch? Or I'm missing some configuration option?

 caezar wrote:
   
   
 I'm pretty sure that I ran both commands before indexing

 Andrzej Bialecki wrote:
 
 
 caezar wrote:
   
   
 Some more information. Debugging reduce method I've noticed, that
 before
 code
 if (fetchDatum == null || dbDatum == null
 || parseText == null || parseData == null) {
   return; // only have inlinks
 }
 my page has fetchDatum, parseText and parseData not null, but dbDatum
 is
 null. Thats why it's skipped :) 
 Any ideas about the reason?
 
 
 Yes - you should run updatedb with this segment, and also run 
 invertlinks with this segment, _before_ trying to index. Otherwise the 
 db status won't be updated properly.


 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



   
   
 
 
   
   

 

   



Please, unsubscribe me

2009-10-28 Thread Nico Sabbi
Hi,
the unsubscription message doesn't work. Please, remove me from the
list.

Thanks.
 



How to specify in webapp where to find indexes?

2009-10-28 Thread Dmitriy Fundak
Hi,
I want to explicitly specify location of indexes folder.
As I understand I should specify some property pointing to index
location in Configuration, before executing method parse on Queue:
Query.parse(query, lang, conf)
Am I right?
Are there any other ways of doing it?
Thanks.


RE: Please, unsubscribe me

2009-10-28 Thread caoyuzhong

The unsubscription message does not work for me too.
Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and 
hadoop mail list?

 Subject: Please, unsubscribe  me
 From: nsa...@officinedigitali.it
 To: nutch-user@lucene.apache.org
 Date: Wed, 28 Oct 2009 16:43:05 +0100
 
 Hi,
 the unsubscription message doesn't work. Please, remove me from the
 list.
 
 Thanks.
  
 
  
_
全新 Windows 7:寻找最适合您的 PC。了解详情。
http://www.microsoft.com/china/windows/buy/ 

RE: Please, unsubscribe me

2009-10-28 Thread Le Manh Cuong
Me too, Could you please help to remove me (cuong09m @gmail.com) from the
nutch and hadoop mail list?

-Original Message-
From: caoyuzhong [mailto:caoyuzh...@hotmail.com] 
Sent: Thursday, October 29, 2009 9:49 AM
To: nutch-user@lucene.apache.org
Subject: RE: Please, unsubscribe me


The unsubscription message does not work for me too.
Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch
and hadoop mail list?

 Subject: Please, unsubscribe  me
 From: nsa...@officinedigitali.it
 To: nutch-user@lucene.apache.org
 Date: Wed, 28 Oct 2009 16:43:05 +0100
 
 Hi,
 the unsubscription message doesn't work. Please, remove me from the
 list.
 
 Thanks.
  
 
  
_
全新 Windows 7:寻找最适合您的 PC。了解详情。
http://www.microsoft.com/china/windows/buy/ 



Re: Please, unsubscribe me

2009-10-28 Thread SunGod
List-Help: mailto:nutch-user-h...@lucene.apache.org
List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org
List-Post: mailto:nutch-user@lucene.apache.org
List-Id: nutch-user.lucene.apache.org

2009/10/29 Le Manh Cuong cuong...@gmail.com

 Me too, Could you please help to remove me (cuong09m @gmail.com) from the
 nutch and hadoop mail list?

 -Original Message-
 From: caoyuzhong [mailto:caoyuzh...@hotmail.com]
 Sent: Thursday, October 29, 2009 9:49 AM
 To: nutch-user@lucene.apache.org
  Subject: RE: Please, unsubscribe me


 The unsubscription message does not work for me too.
 Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch
 and hadoop mail list?

  Subject: Please, unsubscribe  me
  From: nsa...@officinedigitali.it
  To: nutch-user@lucene.apache.org
  Date: Wed, 28 Oct 2009 16:43:05 +0100
 
  Hi,
  the unsubscription message doesn't work. Please, remove me from the
  list.
 
  Thanks.
 
 

 _
 全新 Windows 7:寻找最适合您的 PC。了解详情。
 http://www.microsoft.com/china/windows/buy/




RE: Please, unsubscribe me

2009-10-28 Thread Le Manh Cuong
Sorry but the last time I try to unsubscribe, It don’t work.
And now it don’t work also, :).

-Original Message-
From: SunGod [mailto:sun...@cheemer.org] 
Sent: Thursday, October 29, 2009 10:09 AM
To: nutch-user@lucene.apache.org
Subject: Re: Please, unsubscribe me

List-Help: mailto:nutch-user-h...@lucene.apache.org
List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org
List-Post: mailto:nutch-user@lucene.apache.org
List-Id: nutch-user.lucene.apache.org

2009/10/29 Le Manh Cuong cuong...@gmail.com

 Me too, Could you please help to remove me (cuong09m @gmail.com) from the
 nutch and hadoop mail list?

 -Original Message-
 From: caoyuzhong [mailto:caoyuzh...@hotmail.com]
 Sent: Thursday, October 29, 2009 9:49 AM
 To: nutch-user@lucene.apache.org
  Subject: RE: Please, unsubscribe me


 The unsubscription message does not work for me too.
 Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch
 and hadoop mail list?

  Subject: Please, unsubscribe  me
  From: nsa...@officinedigitali.it
  To: nutch-user@lucene.apache.org
  Date: Wed, 28 Oct 2009 16:43:05 +0100
 
  Hi,
  the unsubscription message doesn't work. Please, remove me from the
  list.
 
  Thanks.
 
 

 _
 全新 Windows 7:寻找最适合您的 PC。了解详情。
 http://www.microsoft.com/china/windows/buy/





Re: How to specify in webapp where to find indexes?

2009-10-28 Thread kevin chen
You can put a full path in nutch-site.xml
property
  namesearcher.dir/name
  value/full/path/crawl/value
  description
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory index containing
  merged indexes, or the directory segments containing segment
  indexes.
  /description
/property


On Wed, 2009-10-28 at 19:36 +0300, Dmitriy Fundak wrote:
 Hi,
 I want to explicitly specify location of indexes folder.
 As I understand I should specify some property pointing to index
 location in Configuration, before executing method parse on Queue:
 Query.parse(query, lang, conf)
 Am I right?
 Are there any other ways of doing it?
 Thanks.