RE: Content of redirected urls empty
:( i realy dont know what to do now ! how all people before me resolved this probleme ? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 15 Mar 2010 19:43:51 + Hi, finaly i learned how to display only indexed URLs in the solr index the url is http://localhost:8080/solr/select/?q=*:*fl=url,content q=*:* is for all entries in the index fl=url,content display only urls and their content. Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have only the target ones (HTTPS) with an empty content. i dont know if some one could explain why nutch is missing the content of redirected urls when indexing !!! Date: Mon, 15 Mar 2010 16:28:03 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. strange i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. well you could query on the hostname or the whole URL is suppose. You could also index with Lucene and use Luke to debug the content of the index but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. _repr_ : representative - see class ReprUrlFixer Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 +
Re: Content of redirected urls empty
Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958 _ IM on the go with Messenger on your phone http://go.microsoft.com/?linkid=9712960 _ Stay in touch. http://go.microsoft.com/?linkid=9712959 _ Take your contacts everywhere http://go.microsoft.com/?linkid=9712959 _ Stay in touch. http://go.microsoft.com/?linkid=9712959
RE: Content of redirected urls empty
Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web
Re: Content of redirected urls empty
and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org
RE: Content of redirected urls empty
Oh sorry i mistook again, and yes you are complitely right 1- The HTTPS has a content in my segment. 2- the HTTP has an empty content. in my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one
RE: Content of redirected urls empty
hi again, i forgot to ask what does mean _repr_ ? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 15 Mar 2010 15:29:48 + Oh sorry i mistook again, and yes you are complitely right 1- The HTTPS has a content in my segment. 2- the HTTP has an empty content. in my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather
Re: Content of redirected urls empty
my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. strange i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. well you could query on the hostname or the whole URL is suppose. You could also index with Lucene and use Luke to debug the content of the index but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. _repr_ : representative - see class ReprUrlFixer Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch
RE: Content of redirected urls empty
Hi, finaly i learned how to display only indexed URLs in the solr index the url is http://localhost:8080/solr/select/?q=*:*fl=url,content q=*:* is for all entries in the index fl=url,content display only urls and their content. Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have only the target ones (HTTPS) with an empty content. i dont know if some one could explain why nutch is missing the content of redirected urls when indexing !!! Date: Mon, 15 Mar 2010 16:28:03 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org my index i have the HTTPS url with the empty content (...it's exactely what you said : it's just mixing the HTTPS url with the content of the HTTP one,) and i expected the other way round : the HTTPS content *with* the HTTP URL. strange i dont know if i have the HTTP url in my index, i dont know how to see all the indexed URLS in SOLR. well you could query on the hostname or the whole URL is suppose. You could also index with Lucene and use Luke to debug the content of the index but i'm sure that when a perform a search using RMS i obtain only the HTTPS url with an empty content (i guess it's the empty content of the HTTP one). but again in the segment the content of the https is not empty. _repr_ : representative - see class ReprUrlFixer Date: Mon, 15 Mar 2010 13:44:33 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org and as i said the last day, on my segment the https has an empty content. hmm it's not what you said in your previous message + I can see it has a signature in the crawlDB so it must have a content. I expect that the content would be indexed under the http:// URL thanks to *_repr_: **http://myDNS/index.html* See BasicIndexingFilter for details. it's just mixing the HTTPS url with the content of the HTTP one. it should be the other way round : the HTTPS content *with* the HTTP URL. Actually the http:// document is not sent to the index at all (see around line 86 in IndexerMapReduce 86) so what you are seeing in the index must be the https doc with _repr_ used as a URL. can you please confirm that : 1/ the segment has a content for the https:// doc 2/ you can find the http:// URL in the index and it has no content HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote: Hi thx for your help, this is a fresh crwal of today: 1- HTTP: bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html URL: http://myDNS/index.html Version: 7 Status: 4 (db_redir_temp) Fetch time: Mon Mar 15 12:15:52 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.018119827 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html 2- HTTPS: bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html URL: https://myDNS/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Mon Mar 15 12:32:34 EDT 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 36000 seconds (0 days) Score: 0.00511379 Signature: 5f84dcec905c24e3e2af902ad9ad7398 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html and as i said the last day, on my segment the https has an empty content. thx Date: Mon, 15 Mar 2010 11:39:46 + Subject: Re: Content of redirected urls empty From: lists.digitalpeb...@gmail.com To: nutch-user@lucene.apache.org Adam, Could you please tell us what the http and https entries look like in the crawlDB (using readdb -url)? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote: no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem
RE: Content of redirected urls empty
no one have an answer !? From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Wed, 10 Mar 2010 21:01:54 + i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958 _ IM on the go with Messenger on your phone http://go.microsoft.com/?linkid=9712960 _ Stay in touch. http://go.microsoft.com/?linkid=9712959 _ Take your contacts everywhere http://go.microsoft.com/?linkid=9712959 _ Stay in touch. http://go.microsoft.com/?linkid=9712959
RE: Content of redirected urls empty
i read lotoff post regarding redirected urls but didnt find a sollution ! From: mbel...@msn.com To: nutch-user@lucene.apache.org; mille...@gmail.com Subject: RE: Content of redirected urls empty Date: Tue, 9 Mar 2010 16:59:05 + hi, i dont know if you did find few minutes to see my problem :) but i want to explain it again, mabe it wasnt clear : i have HTTP pages redirected to HTTPS (but it's the same URL): HTTP://page1.com redirrected to HTTPS://page1.com the content of my page HTTP is empty. the content of my page HTTPS is not empty in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of HTTPS page is not empty but in my index i found the HTTP one with the empty content. is there a maner to tell to nutch to index the url with the non empty content? or why nutch doesnt index the target URL rather than indexing the empty (origin) one ?? thx a lot From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:08:06 + i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958 _ IM on the go with Messenger on your phone http://go.microsoft.com/?linkid=9712960 _ Stay in touch. http://go.microsoft.com/?linkid=9712959 _ Take your contacts everywhere http://go.microsoft.com/?linkid=9712959
RE: Content of redirected urls empty
is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? _ Check your Hotmail from your phone. http://go.microsoft.com/?linkid=9712957 _ Stay in touch. http://go.microsoft.com/?linkid=9712959
Re: Content of redirected urls empty
On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Content of redirected urls empty
Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958
RE: Content of redirected urls empty
i'm sorry...i just checked twice...and in my index i have the original URL, which is the HTTP one with the empty content...but it dosent index the HTTPS oneand i using solr index thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: Content of redirected urls empty Date: Mon, 8 Mar 2010 17:01:34 + Hi, i'v just dumped my segments and found that i have both 2 URLS, the original one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL (HTTPS) with NON EMPTY content ! but in my search i found only the HTTPS URL with an empty content !! logically the content of the HTTPS URL is not empty ! it's just mixing the HTTPS url with the content of the HTTP one. our redirect is done by java code response.sendRedirect(…), so it seams to be http redirect right ?? thx for helping me :) Date: Mon, 8 Mar 2010 15:51:34 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Content of redirected urls empty On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Live connected with Messenger on your phone http://go.microsoft.com/?linkid=9712958 _ IM on the go with Messenger on your phone http://go.microsoft.com/?linkid=9712960
Content of redirected urls empty
hi, the content of my redirected urls is empty...but stil have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? _ Check your Hotmail from your phone. http://go.microsoft.com/?linkid=9712957