Re: Nutch indexes less pages, then it fetches
what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks
Re: Nutch indexes less pages, then it fetches
Sorry, but how could I do this? 皮皮 wrote: check the parse data first, maybe it parse unsuccessful. 2009/10/27 caezar caeza...@gmail.com Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092612.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092907.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks
Re: Nutch indexes less pages, then it fetches
Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
hmm i have no idea now. check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks
Re: Nutch indexes less pages, then it fetches
In the IndexerMapReduce.reduce there is a code: if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { continue; } And the status of the redirect target URL is really linked. Thats why it's skipped. But what does this status mean? reinhard schwab wrote: hmm i have no idea now. check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093649.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? caezar wrote: In the IndexerMapReduce.reduce there is a code: if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { continue; } And the status of the redirect target URL is really linked. Thats why it's skipped. But what does this status mean? reinhard schwab wrote: hmm i have no idea now. check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you can check also the segment status of this url here is an example reinh...@thord:bin/nutch readseg -get crawl/segments/20091028122455 http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37big=1seitenid=20; it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: Thanks, that was really helpful. I've moved forward but still not found the solution. So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinh...@thord:bin/nutch readdb crawldb -url url it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: Hi All, I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093867.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[ANNOUNCE] Lucene MeetUp in Oakland, CA - Tue Nov 3rd @ 8PM
(cross posted to many user lists, please confine reply to gene...@lucene) There will be a Lucene meetup next week at ApacheCon in Oakland, CA on Tuesday, November 3rd. Meetups are free (the rest of the conference is not). See: http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009 For other meetups at ApacheCon, see http://www.us.apachecon.com/c/acus2009/schedule/meetups. Also, one last reminder, we have a lot of Lucene/Solr related content scheduled for ApacheCon including Lucene and Solr training (Monday and Tuesday), two full days of Lucene related talks on Thursday and Friday, plus the Meetup. I also know there will be a lot of Lucene ecosystem committers at AC this year, so it's a great way to interact with people working on your favorite Lucene projects. See http://www.us.apachecon.com for more info on the conference. -Hoss (channeling Grant)
Re: Nutch indexes less pages, then it fetches
I'm pretty sure that I ran both commands before indexing Andrzej Bialecki wrote: caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26094770.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
I've compared the segments data of the URL which have no redirect and was indexed correctly, with this bad URL, and there is really a difference. First one have db record in the segment: Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time: Thu Jan 01 02:00:00 EET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1256738472613 But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option? caezar wrote: I'm pretty sure that I ran both commands before indexing Andrzej Bialecki wrote: caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: I've compared the segments data of the URL which have no redirect and was indexed correctly, with this bad URL, and there is really a difference. First one have db record in the segment: Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time: Thu Jan 01 02:00:00 EET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1256738472613 But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option? caezar wrote: I'm pretty sure that I ran both commands before indexing Andrzej Bialecki wrote: caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch indexes less pages, then it fetches
No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition: if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } in IndexerMapReduce code. For this page dbDatum is null, so it is not indexed! reinhard schwab wrote: is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: I've compared the segments data of the URL which have no redirect and was indexed correctly, with this bad URL, and there is really a difference. First one have db record in the segment: Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time: Thu Jan 01 02:00:00 EET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1256738472613 But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option? caezar wrote: I'm pretty sure that I ran both commands before indexing Andrzej Bialecki wrote: caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch indexes less pages, then it fetches
what is in the crawl db? reinh...@thord:bin/nutch readdb crawldb -url url caezar schrieb: No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition: if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } in IndexerMapReduce code. For this page dbDatum is null, so it is not indexed! reinhard schwab wrote: is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: I've compared the segments data of the URL which have no redirect and was indexed correctly, with this bad URL, and there is really a difference. First one have db record in the segment: Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time: Thu Jan 01 02:00:00 EET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1256738472613 But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option? caezar wrote: I'm pretty sure that I ran both commands before indexing Andrzej Bialecki wrote: caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Please, unsubscribe me
Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks.
How to specify in webapp where to find indexes?
Hi, I want to explicitly specify location of indexes folder. As I understand I should specify some property pointing to index location in Configuration, before executing method parse on Queue: Query.parse(query, lang, conf) Am I right? Are there any other ways of doing it? Thanks.
RE: Please, unsubscribe me
The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100 Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks. _ 全新 Windows 7:寻找最适合您的 PC。了解详情。 http://www.microsoft.com/china/windows/buy/
RE: Please, unsubscribe me
Me too, Could you please help to remove me (cuong09m @gmail.com) from the nutch and hadoop mail list? -Original Message- From: caoyuzhong [mailto:caoyuzh...@hotmail.com] Sent: Thursday, October 29, 2009 9:49 AM To: nutch-user@lucene.apache.org Subject: RE: Please, unsubscribe me The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100 Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks. _ 全新 Windows 7:寻找最适合您的 PC。了解详情。 http://www.microsoft.com/china/windows/buy/
Re: Please, unsubscribe me
List-Help: mailto:nutch-user-h...@lucene.apache.org List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org List-Post: mailto:nutch-user@lucene.apache.org List-Id: nutch-user.lucene.apache.org 2009/10/29 Le Manh Cuong cuong...@gmail.com Me too, Could you please help to remove me (cuong09m @gmail.com) from the nutch and hadoop mail list? -Original Message- From: caoyuzhong [mailto:caoyuzh...@hotmail.com] Sent: Thursday, October 29, 2009 9:49 AM To: nutch-user@lucene.apache.org Subject: RE: Please, unsubscribe me The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100 Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks. _ 全新 Windows 7:寻找最适合您的 PC。了解详情。 http://www.microsoft.com/china/windows/buy/
RE: Please, unsubscribe me
Sorry but the last time I try to unsubscribe, It don’t work. And now it don’t work also, :). -Original Message- From: SunGod [mailto:sun...@cheemer.org] Sent: Thursday, October 29, 2009 10:09 AM To: nutch-user@lucene.apache.org Subject: Re: Please, unsubscribe me List-Help: mailto:nutch-user-h...@lucene.apache.org List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org List-Post: mailto:nutch-user@lucene.apache.org List-Id: nutch-user.lucene.apache.org 2009/10/29 Le Manh Cuong cuong...@gmail.com Me too, Could you please help to remove me (cuong09m @gmail.com) from the nutch and hadoop mail list? -Original Message- From: caoyuzhong [mailto:caoyuzh...@hotmail.com] Sent: Thursday, October 29, 2009 9:49 AM To: nutch-user@lucene.apache.org Subject: RE: Please, unsubscribe me The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100 Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks. _ 全新 Windows 7:寻找最适合您的 PC。了解详情。 http://www.microsoft.com/china/windows/buy/
Re: How to specify in webapp where to find indexes?
You can put a full path in nutch-site.xml property namesearcher.dir/name value/full/path/crawl/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property On Wed, 2009-10-28 at 19:36 +0300, Dmitriy Fundak wrote: Hi, I want to explicitly specify location of indexes folder. As I understand I should specify some property pointing to index location in Configuration, before executing method parse on Queue: Query.parse(query, lang, conf) Am I right? Are there any other ways of doing it? Thanks.