RE: Content of redirected urls empty

2010-03-18 Thread BELLINI ADAM


:( i realy dont know what to do now ! how all people before me resolved this 
probleme ?





 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 15 Mar 2010 19:43:51 +
 
 
 Hi, 
 
 finaly i learned how to display only indexed URLs in the solr index
 
 the url is  http://localhost:8080/solr/select/?q=*:*fl=url,content
 
 q=*:*  is for all entries in the index
 fl=url,content  display only urls and their content.
 
 
 Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have 
 only the target ones (HTTPS) with an empty content.
 
 
 
 i dont know if some one could explain why nutch is missing the content of 
 redirected urls  when indexing !!!
 
 
 
  Date: Mon, 15 Mar 2010 16:28:03 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
  
   my index i have the HTTPS  url with the empty content (...it's exactely
   what you said : it's just mixing the HTTPS url with
   the content of the HTTP one,) and i expected the other way round : the
   HTTPS content *with* the HTTP URL.
  
  
  strange
  
  
  
   i dont know if i have the HTTP url in my index, i dont know how to see all
   the indexed URLS in SOLR.
  
  
  well you could query on the hostname or the whole URL is suppose.
  
  You could also index with Lucene and use Luke to debug the content of the
  index
  
  
   but i'm sure that when a perform a search using RMS i obtain only the 
   HTTPS
   url with an empty content (i guess it's the empty content of the HTTP 
   one).
   but again in the segment the content of the https is not empty.
  
  
  _repr_  : representative - see class ReprUrlFixer
  
  
  
  
  
  
  
  
  
Date: Mon, 15 Mar 2010 13:44:33 +
Subject: Re: Content of redirected urls empty
From: lists.digitalpeb...@gmail.com
To: nutch-user@lucene.apache.org
   

 and as i said the last day, on my segment the https has an empty
   content.
   
   
hmm it's not what you said in your previous message + I can see it has a
signature in the crawlDB so it must have a content.
   
I expect that the content would be indexed under the http://  URL thanks
   to
*_repr_: **http://myDNS/index.html*
   
See BasicIndexingFilter for details.
   
it's just mixing the HTTPS url with the content of the HTTP one.
   
   
it should be the other way round : the HTTPS content *with* the HTTP 
URL.
Actually the http:// document is not sent to the index at all (see
   around
line 86 in IndexerMapReduce 86) so what you are seeing in the index must
   be
the https doc with _repr_ used as a URL.
   
can you please confirm that :
1/ the segment has a content for the https:// doc
2/ you can find the http:// URL in the index and it has no content
   
HTH
   
Julien
   
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
   

 Hi
 thx for your help,

 this is a fresh crwal of today:


 1- HTTP:
 bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html

 URL: http://myDNS/index.html
 Version: 7
 Status: 4 (db_redir_temp)
 Fetch time: Mon Mar 15 12:15:52 EDT 2010
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 36000 seconds (0 days)
 Score: 0.018119827
 Signature: null
 Metadata: _pst_: temp_moved(13), lastModified=0:
   https://myDNS/index.html




 2- HTTPS:
 bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html

 URL: https://myDNS/index.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Mon Mar 15 12:32:34 EDT 2010
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 36000 seconds (0 days)
 Score: 0.00511379
 Signature: 5f84dcec905c24e3e2af902ad9ad7398
 Metadata: _pst_: success(1), lastModified=0_repr_:
   http://myDNS/index.html






 and as i said the last day, on my segment the https has an empty
   content.

 thx


  Date: Mon, 15 Mar 2010 11:39:46 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
 
  Adam,
 
  Could you please tell us what the http and https entries look like 
  in
   the
  crawlDB (using readdb -url)?
 
  J.
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
 
  
   no one have an answer !?
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Wed, 10 Mar 2010 21:01:54 +

Re: Content of redirected urls empty

2010-03-15 Thread Julien Nioche
Adam,

Could you please tell us what the http and https entries look like in the
crawlDB (using readdb -url)?

J.
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:


 no one have an answer !?





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Wed, 10 Mar 2010 21:01:54 +
 
 
  i read lotoff post regarding redirected urls but didnt find a sollution !
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
   hi,
  
   i dont know if you did find few minutes to see my problem :)
  
   but i want to explain it again, mabe it wasnt clear :
  
  
   i have HTTP  pages redirected to HTTPS   (but it's the same URL):
  
   HTTP://page1.com   redirrected to HTTPS://page1.com
  
   the content of my page HTTP is empty.
   the content of my page HTTPS is not empty
  
   in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content
 of HTTPS page is not empty
  
   but in my index i found the HTTP one with the empty content.
  
   is there a maner to tell to nutch to index the url with the non empty
 content? or why nutch doesnt index the target URL rather than indexing the
 empty (origin) one ??
  
   thx a lot
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: Content of redirected urls empty
Date: Mon, 8 Mar 2010 17:08:06 +
   
   
i'm sorry...i just checked twice...and in my index i have the
 original URL, which is  the HTTP one with the empty content...but it dosent
 index the HTTPS oneand i using solr index
thx
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 8 Mar 2010 17:01:34 +




 Hi, i'v just dumped my segments and found that i have both 2 URLS,
 the original one (HTTP) with an empty content and the REDIRCTED TO or the
 DESTINATION URL (HTTPS) with NON EMPTY content !

 but in my search i found only the HTTPS URL with an empty content
 !! logically the content of the HTTPS  URL is not empty !
 it's just mixing the HTTPS url with the content of the HTTP one.


 our redirect is done by java code  response.sendRedirect(…), so it
 seams to be http redirect right ??

 thx for helping me :)


  Date: Mon, 8 Mar 2010 15:51:34 +0100
  From: a...@getopt.org
  To: nutch-user@lucene.apache.org
  Subject: Re: Content of redirected urls empty
 
  On 2010-03-08 14:55, BELLINI ADAM wrote:
  
  
   is there any idea guys ??
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: Content of redirected urls empty
   Date: Fri, 5 Mar 2010 22:01:05 +
  
  
  
   hi,
   the content of my redirected urls is empty...but still have
 the other metadata...
   i have an http urls that is redirected to https.
   in my index i find the http URL but with an empty content...
   could you explain it plz?
 
  There are two ways to redirect - one is with protocol, and the
 other is
  with content (either meta refresh, or javascript).
 
  When you dump the segment, is there really no content for the
 redirected
  url?
 
 
  --
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 

 _
 Live connected with Messenger on your phone
 http://go.microsoft.com/?linkid=9712958
   
_
IM on the go with Messenger on your phone
http://go.microsoft.com/?linkid=9712960
  
   _
   Stay in touch.
   http://go.microsoft.com/?linkid=9712959
 
  _
  Take your contacts everywhere
  http://go.microsoft.com/?linkid=9712959

 _
 Stay in touch.
 http://go.microsoft.com/?linkid=9712959



RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM

Hi
thx for your help,

this is a fresh crwal of today:


1- HTTP:
bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html

URL: http://myDNS/index.html
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Mon Mar 15 12:15:52 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.018119827
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html




2- HTTPS: 
bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html

URL: https://myDNS/index.html
Version: 7
Status: 2 (db_fetched)
Fetch time: Mon Mar 15 12:32:34 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.00511379
Signature: 5f84dcec905c24e3e2af902ad9ad7398
Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html






and as i said the last day, on my segment the https has an empty content.

thx


 Date: Mon, 15 Mar 2010 11:39:46 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 
 Could you please tell us what the http and https entries look like in the
 crawlDB (using readdb -url)?
 
 J.
 -- 
 DigitalPebble Ltd
 http://www.digitalpebble.com
 
 On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
 
 
  no one have an answer !?
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Wed, 10 Mar 2010 21:01:54 +
  
  
   i read lotoff post regarding redirected urls but didnt find a sollution !
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Tue, 9 Mar 2010 16:59:05 +
   
   
   
hi,
   
i dont know if you did find few minutes to see my problem :)
   
but i want to explain it again, mabe it wasnt clear :
   
   
i have HTTP  pages redirected to HTTPS   (but it's the same URL):
   
HTTP://page1.com   redirrected to HTTPS://page1.com
   
the content of my page HTTP is empty.
the content of my page HTTPS is not empty
   
in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content
  of HTTPS page is not empty
   
but in my index i found the HTTP one with the empty content.
   
is there a maner to tell to nutch to index the url with the non empty
  content? or why nutch doesnt index the target URL rather than indexing the
  empty (origin) one ??
   
thx a lot
   
   
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 8 Mar 2010 17:08:06 +


 i'm sorry...i just checked twice...and in my index i have the
  original URL, which is  the HTTP one with the empty content...but it dosent
  index the HTTPS oneand i using solr index
 thx



  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: RE: Content of redirected urls empty
  Date: Mon, 8 Mar 2010 17:01:34 +
 
 
 
 
  Hi, i'v just dumped my segments and found that i have both 2 URLS,
  the original one (HTTP) with an empty content and the REDIRCTED TO or the
  DESTINATION URL (HTTPS) with NON EMPTY content !
 
  but in my search i found only the HTTPS URL with an empty content
  !! logically the content of the HTTPS  URL is not empty !
  it's just mixing the HTTPS url with the content of the HTTP one.
 
 
  our redirect is done by java code  response.sendRedirect(…), so it
  seams to be http redirect right ??
 
  thx for helping me :)
 
 
   Date: Mon, 8 Mar 2010 15:51:34 +0100
   From: a...@getopt.org
   To: nutch-user@lucene.apache.org
   Subject: Re: Content of redirected urls empty
  
   On 2010-03-08 14:55, BELLINI ADAM wrote:
   
   
is there any idea guys ??
   
   
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +
   
   
   
hi,
the content of my redirected urls is empty...but still have
  the other metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?
  
   There are two ways to redirect - one is with protocol, and the
  other is
   with content (either meta refresh, or javascript).
  
   When you dump the segment, is there really no content for the
  redirected
   url?
  
  
   --
   Best regards,
   Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
   [__ || __|__/|__||\/|  Information Retrieval, Semantic Web

Re: Content of redirected urls empty

2010-03-15 Thread Julien Nioche

 and as i said the last day, on my segment the https has an empty content.


hmm it's not what you said in your previous message + I can see it has a
signature in the crawlDB so it must have a content.

I expect that the content would be indexed under the http://  URL thanks to
*_repr_: **http://myDNS/index.html*

See BasicIndexingFilter for details.

it's just mixing the HTTPS url with the content of the HTTP one.


it should be the other way round : the HTTPS content *with* the HTTP URL.
Actually the http:// document is not sent to the index at all (see around
line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
the https doc with _repr_ used as a URL.

can you please confirm that :
1/ the segment has a content for the https:// doc
2/ you can find the http:// URL in the index and it has no content

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com
On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:


 Hi
 thx for your help,

 this is a fresh crwal of today:


 1- HTTP:
 bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html

 URL: http://myDNS/index.html
 Version: 7
 Status: 4 (db_redir_temp)
 Fetch time: Mon Mar 15 12:15:52 EDT 2010
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 36000 seconds (0 days)
 Score: 0.018119827
 Signature: null
 Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html




 2- HTTPS:
 bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html

 URL: https://myDNS/index.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Mon Mar 15 12:32:34 EDT 2010
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 36000 seconds (0 days)
 Score: 0.00511379
 Signature: 5f84dcec905c24e3e2af902ad9ad7398
 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html






 and as i said the last day, on my segment the https has an empty content.

 thx


  Date: Mon, 15 Mar 2010 11:39:46 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
 
  Adam,
 
  Could you please tell us what the http and https entries look like in the
  crawlDB (using readdb -url)?
 
  J.
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
 
  
   no one have an answer !?
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Wed, 10 Mar 2010 21:01:54 +
   
   
i read lotoff post regarding redirected urls but didnt find a
 sollution !
   
   
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Tue, 9 Mar 2010 16:59:05 +



 hi,

 i dont know if you did find few minutes to see my problem :)

 but i want to explain it again, mabe it wasnt clear :


 i have HTTP  pages redirected to HTTPS   (but it's the same URL):

 HTTP://page1.com   redirrected to HTTPS://page1.com

 the content of my page HTTP is empty.
 the content of my page HTTPS is not empty

 in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
 content
   of HTTPS page is not empty

 but in my index i found the HTTP one with the empty content.

 is there a maner to tell to nutch to index the url with the non
 empty
   content? or why nutch doesnt index the target URL rather than indexing
 the
   empty (origin) one ??

 thx a lot





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: RE: Content of redirected urls empty
  Date: Mon, 8 Mar 2010 17:08:06 +
 
 
  i'm sorry...i just checked twice...and in my index i have the
   original URL, which is  the HTTP one with the empty content...but it
 dosent
   index the HTTPS oneand i using solr index
  thx
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:01:34 +
  
  
  
  
   Hi, i'v just dumped my segments and found that i have both 2
 URLS,
   the original one (HTTP) with an empty content and the REDIRCTED TO or
 the
   DESTINATION URL (HTTPS) with NON EMPTY content !
  
   but in my search i found only the HTTPS URL with an empty
 content
   !! logically the content of the HTTPS  URL is not empty !
   it's just mixing the HTTPS url with the content of the HTTP
 one.
  
  
   our redirect is done by java code  response.sendRedirect(…), so
 it
   seams to be http redirect right ??
  
   thx for helping me :)
  
  
Date: Mon, 8 Mar 2010 15:51:34 +0100
From: a...@getopt.org
To: nutch-user@lucene.apache.org

RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM



Oh sorry i mistook again, and yes you are complitely right
1- The HTTPS has a content in my segment.
2- the HTTP has an empty content.

in
my index i have the HTTPS  url with the empty content (...it's exactely
what you said : it's just mixing the HTTPS url with
the content of the HTTP one,) and i expected the other way round : the
HTTPS content *with* the HTTP URL.


i dont know if i have the HTTP url in my index, i dont know how to see all the 
indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i 
obtain only the HTTPS url with an empty content (i guess it's the empty content 
of the HTTP one).
but again in the segment the content of the https is not empty.



 Date: Mon, 15 Mar 2010 13:44:33 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org
 
 
  and as i said the last day, on my segment the https has an empty content.
 
 
 hmm it's not what you said in your previous message + I can see it has a
 signature in the crawlDB so it must have a content.
 
 I expect that the content would be indexed under the http://  URL thanks to
 *_repr_: **http://myDNS/index.html*
 
 See BasicIndexingFilter for details.
 
 it's just mixing the HTTPS url with the content of the HTTP one.
 
 
 it should be the other way round : the HTTPS content *with* the HTTP URL.
 Actually the http:// document is not sent to the index at all (see around
 line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
 the https doc with _repr_ used as a URL.
 
 can you please confirm that :
 1/ the segment has a content for the https:// doc
 2/ you can find the http:// URL in the index and it has no content
 
 HTH
 
 Julien
 
 -- 
 DigitalPebble Ltd
 http://www.digitalpebble.com
 On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
 
 
  Hi
  thx for your help,
 
  this is a fresh crwal of today:
 
 
  1- HTTP:
  bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
 
  URL: http://myDNS/index.html
  Version: 7
  Status: 4 (db_redir_temp)
  Fetch time: Mon Mar 15 12:15:52 EDT 2010
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 36000 seconds (0 days)
  Score: 0.018119827
  Signature: null
  Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
 
 
 
 
  2- HTTPS:
  bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
 
  URL: https://myDNS/index.html
  Version: 7
  Status: 2 (db_fetched)
  Fetch time: Mon Mar 15 12:32:34 EDT 2010
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 36000 seconds (0 days)
  Score: 0.00511379
  Signature: 5f84dcec905c24e3e2af902ad9ad7398
  Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
 
 
 
 
 
 
  and as i said the last day, on my segment the https has an empty content.
 
  thx
 
 
   Date: Mon, 15 Mar 2010 11:39:46 +
   Subject: Re: Content of redirected urls empty
   From: lists.digitalpeb...@gmail.com
   To: nutch-user@lucene.apache.org
  
   Adam,
  
   Could you please tell us what the http and https entries look like in the
   crawlDB (using readdb -url)?
  
   J.
   --
   DigitalPebble Ltd
   http://www.digitalpebble.com
  
   On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
  
   
no one have an answer !?
   
   
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Wed, 10 Mar 2010 21:01:54 +


 i read lotoff post regarding redirected urls but didnt find a
  sollution !





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Tue, 9 Mar 2010 16:59:05 +
 
 
 
  hi,
 
  i dont know if you did find few minutes to see my problem :)
 
  but i want to explain it again, mabe it wasnt clear :
 
 
  i have HTTP  pages redirected to HTTPS   (but it's the same URL):
 
  HTTP://page1.com   redirrected to HTTPS://page1.com
 
  the content of my page HTTP is empty.
  the content of my page HTTPS is not empty
 
  in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
  content
of HTTPS page is not empty
 
  but in my index i found the HTTP one with the empty content.
 
  is there a maner to tell to nutch to index the url with the non
  empty
content? or why nutch doesnt index the target URL rather than indexing
  the
empty (origin) one ??
 
  thx a lot
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:08:06 +
  
  
   i'm sorry...i just checked twice...and in my index i have the
original URL, which is  the HTTP one

RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM


hi again,

i forgot to ask what does mean   _repr_  ?



 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 15 Mar 2010 15:29:48 +
 
 
 
 
 Oh sorry i mistook again, and yes you are complitely right
 1- The HTTPS has a content in my segment.
 2- the HTTP has an empty content.
 
 in
 my index i have the HTTPS  url with the empty content (...it's exactely
 what you said : it's just mixing the HTTPS url with
 the content of the HTTP one,) and i expected the other way round : the
 HTTPS content *with* the HTTP URL.
 
 
 i dont know if i have the HTTP url in my index, i dont know how to see all 
 the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS 
 i obtain only the HTTPS url with an empty content (i guess it's the empty 
 content of the HTTP one).
 but again in the segment the content of the https is not empty.
 
 
 
  Date: Mon, 15 Mar 2010 13:44:33 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
  
  
   and as i said the last day, on my segment the https has an empty content.
  
  
  hmm it's not what you said in your previous message + I can see it has a
  signature in the crawlDB so it must have a content.
  
  I expect that the content would be indexed under the http://  URL thanks to
  *_repr_: **http://myDNS/index.html*
  
  See BasicIndexingFilter for details.
  
  it's just mixing the HTTPS url with the content of the HTTP one.
  
  
  it should be the other way round : the HTTPS content *with* the HTTP URL.
  Actually the http:// document is not sent to the index at all (see around
  line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
  the https doc with _repr_ used as a URL.
  
  can you please confirm that :
  1/ the segment has a content for the https:// doc
  2/ you can find the http:// URL in the index and it has no content
  
  HTH
  
  Julien
  
  -- 
  DigitalPebble Ltd
  http://www.digitalpebble.com
  On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
  
  
   Hi
   thx for your help,
  
   this is a fresh crwal of today:
  
  
   1- HTTP:
   bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
  
   URL: http://myDNS/index.html
   Version: 7
   Status: 4 (db_redir_temp)
   Fetch time: Mon Mar 15 12:15:52 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.018119827
   Signature: null
   Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
  
  
  
  
   2- HTTPS:
   bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
  
   URL: https://myDNS/index.html
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Mon Mar 15 12:32:34 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.00511379
   Signature: 5f84dcec905c24e3e2af902ad9ad7398
   Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
  
  
  
  
  
  
   and as i said the last day, on my segment the https has an empty content.
  
   thx
  
  
Date: Mon, 15 Mar 2010 11:39:46 +
Subject: Re: Content of redirected urls empty
From: lists.digitalpeb...@gmail.com
To: nutch-user@lucene.apache.org
   
Adam,
   
Could you please tell us what the http and https entries look like in 
the
crawlDB (using readdb -url)?
   
J.
--
DigitalPebble Ltd
http://www.digitalpebble.com
   
On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
   

 no one have an answer !?





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Wed, 10 Mar 2010 21:01:54 +
 
 
  i read lotoff post regarding redirected urls but didnt find a
   sollution !
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
   hi,
  
   i dont know if you did find few minutes to see my problem :)
  
   but i want to explain it again, mabe it wasnt clear :
  
  
   i have HTTP  pages redirected to HTTPS   (but it's the same URL):
  
   HTTP://page1.com   redirrected to HTTPS://page1.com
  
   the content of my page HTTP is empty.
   the content of my page HTTPS is not empty
  
   in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
   content
 of HTTPS page is not empty
  
   but in my index i found the HTTP one with the empty content.
  
   is there a maner to tell to nutch to index the url with the non
   empty
 content? or why nutch doesnt index the target URL rather

Re: Content of redirected urls empty

2010-03-15 Thread Julien Nioche
 my index i have the HTTPS  url with the empty content (...it's exactely
 what you said : it's just mixing the HTTPS url with
 the content of the HTTP one,) and i expected the other way round : the
 HTTPS content *with* the HTTP URL.


strange



 i dont know if i have the HTTP url in my index, i dont know how to see all
 the indexed URLS in SOLR.


well you could query on the hostname or the whole URL is suppose.

You could also index with Lucene and use Luke to debug the content of the
index


 but i'm sure that when a perform a search using RMS i obtain only the HTTPS
 url with an empty content (i guess it's the empty content of the HTTP one).
 but again in the segment the content of the https is not empty.


_repr_  : representative - see class ReprUrlFixer









  Date: Mon, 15 Mar 2010 13:44:33 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
 
  
   and as i said the last day, on my segment the https has an empty
 content.
 
 
  hmm it's not what you said in your previous message + I can see it has a
  signature in the crawlDB so it must have a content.
 
  I expect that the content would be indexed under the http://  URL thanks
 to
  *_repr_: **http://myDNS/index.html*
 
  See BasicIndexingFilter for details.
 
  it's just mixing the HTTPS url with the content of the HTTP one.
 
 
  it should be the other way round : the HTTPS content *with* the HTTP URL.
  Actually the http:// document is not sent to the index at all (see
 around
  line 86 in IndexerMapReduce 86) so what you are seeing in the index must
 be
  the https doc with _repr_ used as a URL.
 
  can you please confirm that :
  1/ the segment has a content for the https:// doc
  2/ you can find the http:// URL in the index and it has no content
 
  HTH
 
  Julien
 
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
  On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
 
  
   Hi
   thx for your help,
  
   this is a fresh crwal of today:
  
  
   1- HTTP:
   bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
  
   URL: http://myDNS/index.html
   Version: 7
   Status: 4 (db_redir_temp)
   Fetch time: Mon Mar 15 12:15:52 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.018119827
   Signature: null
   Metadata: _pst_: temp_moved(13), lastModified=0:
 https://myDNS/index.html
  
  
  
  
   2- HTTPS:
   bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
  
   URL: https://myDNS/index.html
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Mon Mar 15 12:32:34 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.00511379
   Signature: 5f84dcec905c24e3e2af902ad9ad7398
   Metadata: _pst_: success(1), lastModified=0_repr_:
 http://myDNS/index.html
  
  
  
  
  
  
   and as i said the last day, on my segment the https has an empty
 content.
  
   thx
  
  
Date: Mon, 15 Mar 2010 11:39:46 +
Subject: Re: Content of redirected urls empty
From: lists.digitalpeb...@gmail.com
To: nutch-user@lucene.apache.org
   
Adam,
   
Could you please tell us what the http and https entries look like in
 the
crawlDB (using readdb -url)?
   
J.
--
DigitalPebble Ltd
http://www.digitalpebble.com
   
On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
   

 no one have an answer !?





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Wed, 10 Mar 2010 21:01:54 +
 
 
  i read lotoff post regarding redirected urls but didnt find a
   sollution !
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
   hi,
  
   i dont know if you did find few minutes to see my problem :)
  
   but i want to explain it again, mabe it wasnt clear :
  
  
   i have HTTP  pages redirected to HTTPS   (but it's the same
 URL):
  
   HTTP://page1.com   redirrected to HTTPS://page1.com
  
   the content of my page HTTP is empty.
   the content of my page HTTPS is not empty
  
   in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
   content
 of HTTPS page is not empty
  
   but in my index i found the HTTP one with the empty content.
  
   is there a maner to tell to nutch to index the url with the non
   empty
 content? or why nutch doesnt index the target URL rather than
 indexing
   the
 empty (origin) one ??
  
   thx a lot
  
  
  
  
  
From: mbel...@msn.com
To: nutch

RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM

Hi, 

finaly i learned how to display only indexed URLs in the solr index

the url is  http://localhost:8080/solr/select/?q=*:*fl=url,content

q=*:*  is for all entries in the index
fl=url,content  display only urls and their content.


Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have 
only the target ones (HTTPS) with an empty content.



i dont know if some one could explain why nutch is missing the content of 
redirected urls  when indexing !!!



 Date: Mon, 15 Mar 2010 16:28:03 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org
 
  my index i have the HTTPS  url with the empty content (...it's exactely
  what you said : it's just mixing the HTTPS url with
  the content of the HTTP one,) and i expected the other way round : the
  HTTPS content *with* the HTTP URL.
 
 
 strange
 
 
 
  i dont know if i have the HTTP url in my index, i dont know how to see all
  the indexed URLS in SOLR.
 
 
 well you could query on the hostname or the whole URL is suppose.
 
 You could also index with Lucene and use Luke to debug the content of the
 index
 
 
  but i'm sure that when a perform a search using RMS i obtain only the HTTPS
  url with an empty content (i guess it's the empty content of the HTTP one).
  but again in the segment the content of the https is not empty.
 
 
 _repr_  : representative - see class ReprUrlFixer
 
 
 
 
 
 
 
 
 
   Date: Mon, 15 Mar 2010 13:44:33 +
   Subject: Re: Content of redirected urls empty
   From: lists.digitalpeb...@gmail.com
   To: nutch-user@lucene.apache.org
  
   
and as i said the last day, on my segment the https has an empty
  content.
  
  
   hmm it's not what you said in your previous message + I can see it has a
   signature in the crawlDB so it must have a content.
  
   I expect that the content would be indexed under the http://  URL thanks
  to
   *_repr_: **http://myDNS/index.html*
  
   See BasicIndexingFilter for details.
  
   it's just mixing the HTTPS url with the content of the HTTP one.
  
  
   it should be the other way round : the HTTPS content *with* the HTTP URL.
   Actually the http:// document is not sent to the index at all (see
  around
   line 86 in IndexerMapReduce 86) so what you are seeing in the index must
  be
   the https doc with _repr_ used as a URL.
  
   can you please confirm that :
   1/ the segment has a content for the https:// doc
   2/ you can find the http:// URL in the index and it has no content
  
   HTH
  
   Julien
  
   --
   DigitalPebble Ltd
   http://www.digitalpebble.com
   On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
  
   
Hi
thx for your help,
   
this is a fresh crwal of today:
   
   
1- HTTP:
bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
   
URL: http://myDNS/index.html
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Mon Mar 15 12:15:52 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.018119827
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
  https://myDNS/index.html
   
   
   
   
2- HTTPS:
bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
   
URL: https://myDNS/index.html
Version: 7
Status: 2 (db_fetched)
Fetch time: Mon Mar 15 12:32:34 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.00511379
Signature: 5f84dcec905c24e3e2af902ad9ad7398
Metadata: _pst_: success(1), lastModified=0_repr_:
  http://myDNS/index.html
   
   
   
   
   
   
and as i said the last day, on my segment the https has an empty
  content.
   
thx
   
   
 Date: Mon, 15 Mar 2010 11:39:46 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org

 Adam,

 Could you please tell us what the http and https entries look like in
  the
 crawlDB (using readdb -url)?

 J.
 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:

 
  no one have an answer !?
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Wed, 10 Mar 2010 21:01:54 +
  
  
   i read lotoff post regarding redirected urls but didnt find a
sollution !
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Tue, 9 Mar 2010 16:59:05 +
   
   
   
hi,
   
i dont know if you did find few minutes to see my problem

RE: Content of redirected urls empty

2010-03-12 Thread BELLINI ADAM

no one have an answer !?





 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Wed, 10 Mar 2010 21:01:54 +
 
 
 i read lotoff post regarding redirected urls but didnt find a sollution !
 
 
 
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
  hi,
  
  i dont know if you did find few minutes to see my problem :)
  
  but i want to explain it again, mabe it wasnt clear :
  
  
  i have HTTP  pages redirected to HTTPS   (but it's the same URL):
  
  HTTP://page1.com   redirrected to HTTPS://page1.com
  
  the content of my page HTTP is empty.
  the content of my page HTTPS is not empty
  
  in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of 
  HTTPS page is not empty
  
  but in my index i found the HTTP one with the empty content.
  
  is there a maner to tell to nutch to index the url with the non empty 
  content? or why nutch doesnt index the target URL rather than indexing the 
  empty (origin) one ??
  
  thx a lot
  
  
  
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:08:06 +
   
   
   i'm sorry...i just checked twice...and in my index i have the original 
   URL, which is  the HTTP one with the empty content...but it dosent index 
   the HTTPS oneand i using solr index
   thx
   
   
   
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: Content of redirected urls empty
Date: Mon, 8 Mar 2010 17:01:34 +




Hi, i'v just dumped my segments and found that i have both 2 URLS, the 
original one (HTTP) with an empty content and the REDIRCTED TO or the 
DESTINATION URL (HTTPS) with NON EMPTY content !

but in my search i found only the HTTPS URL with an empty content !! 
logically the content of the HTTPS  URL is not empty !
it's just mixing the HTTPS url with the content of the HTTP one.


our redirect is done by java code  response.sendRedirect(…), so it 
seams to be http redirect right ??

thx for helping me :)


 Date: Mon, 8 Mar 2010 15:51:34 +0100
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Content of redirected urls empty
 
 On 2010-03-08 14:55, BELLINI ADAM wrote:
 
 
  is there any idea guys ??
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: Content of redirected urls empty
  Date: Fri, 5 Mar 2010 22:01:05 +
 
 
 
  hi,
  the content of my redirected urls is empty...but still have the 
  other metadata...
  i have an http urls that is redirected to https.
  in my index i find the http URL but with an empty content...
  could you explain it plz?
 
 There are two ways to redirect - one is with protocol, and the other 
 is 
 with content (either meta refresh, or javascript).
 
 When you dump the segment, is there really no content for the 
 redirected 
 url?
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958
   
   _
   IM on the go with Messenger on your phone
   http://go.microsoft.com/?linkid=9712960

  _
  Stay in touch.
  http://go.microsoft.com/?linkid=9712959
 
 _
 Take your contacts everywhere
 http://go.microsoft.com/?linkid=9712959
  
_
Stay in touch.
http://go.microsoft.com/?linkid=9712959

RE: Content of redirected urls empty

2010-03-10 Thread BELLINI ADAM

i read lotoff post regarding redirected urls but didnt find a sollution !





 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Tue, 9 Mar 2010 16:59:05 +
 
 
 
 hi,
 
 i dont know if you did find few minutes to see my problem :)
 
 but i want to explain it again, mabe it wasnt clear :
 
 
 i have HTTP  pages redirected to HTTPS   (but it's the same URL):
 
 HTTP://page1.com   redirrected to HTTPS://page1.com
 
 the content of my page HTTP is empty.
 the content of my page HTTPS is not empty
 
 in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of 
 HTTPS page is not empty
 
 but in my index i found the HTTP one with the empty content.
 
 is there a maner to tell to nutch to index the url with the non empty 
 content? or why nutch doesnt index the target URL rather than indexing the 
 empty (origin) one ??
 
 thx a lot
 
 
 
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: RE: Content of redirected urls empty
  Date: Mon, 8 Mar 2010 17:08:06 +
  
  
  i'm sorry...i just checked twice...and in my index i have the original URL, 
  which is  the HTTP one with the empty content...but it dosent index the 
  HTTPS oneand i using solr index
  thx
  
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:01:34 +
   
   
   
   
   Hi, i'v just dumped my segments and found that i have both 2 URLS, the 
   original one (HTTP) with an empty content and the REDIRCTED TO or the 
   DESTINATION URL (HTTPS) with NON EMPTY content !
   
   but in my search i found only the HTTPS URL with an empty content !! 
   logically the content of the HTTPS  URL is not empty !
   it's just mixing the HTTPS url with the content of the HTTP one.
   
   
   our redirect is done by java code  response.sendRedirect(…), so it seams 
   to be http redirect right ??
   
   thx for helping me :)
   
   
Date: Mon, 8 Mar 2010 15:51:34 +0100
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: Content of redirected urls empty

On 2010-03-08 14:55, BELLINI ADAM wrote:


 is there any idea guys ??


 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: Content of redirected urls empty
 Date: Fri, 5 Mar 2010 22:01:05 +



 hi,
 the content of my redirected urls is empty...but still have the 
 other metadata...
 i have an http urls that is redirected to https.
 in my index i find the http URL but with an empty content...
 could you explain it plz?

There are two ways to redirect - one is with protocol, and the other is 
with content (either meta refresh, or javascript).

When you dump the segment, is there really no content for the 
redirected 
url?


-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

   
   _
   Live connected with Messenger on your phone
   http://go.microsoft.com/?linkid=9712958

  _
  IM on the go with Messenger on your phone
  http://go.microsoft.com/?linkid=9712960
 
 _
 Stay in touch.
 http://go.microsoft.com/?linkid=9712959
  
_
Take your contacts everywhere
http://go.microsoft.com/?linkid=9712959

RE: Content of redirected urls empty

2010-03-08 Thread BELLINI ADAM


is there any idea guys ??


 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: Content of redirected urls empty
 Date: Fri, 5 Mar 2010 22:01:05 +
 
 
 
 hi,
 the content of my redirected urls is empty...but still have the other 
 metadata...
 i have an http urls that is redirected to https.
 in my index i find the http URL but with an empty content...
 could you explain it plz?
 
 _
 Check your Hotmail from your phone. 
 http://go.microsoft.com/?linkid=9712957
  
_
Stay in touch.
http://go.microsoft.com/?linkid=9712959

Re: Content of redirected urls empty

2010-03-08 Thread Andrzej Bialecki

On 2010-03-08 14:55, BELLINI ADAM wrote:



is there any idea guys ??



From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +



hi,
the content of my redirected urls is empty...but still have the other 
metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?


There are two ways to redirect - one is with protocol, and the other is 
with content (either meta refresh, or javascript).


When you dump the segment, is there really no content for the redirected 
url?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Content of redirected urls empty

2010-03-08 Thread BELLINI ADAM



Hi, i'v just dumped my segments and found that i have both 2 URLS, the original 
one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL 
(HTTPS) with NON EMPTY content !

but in my search i found only the HTTPS URL with an empty content !! logically 
the content of the HTTPS  URL is not empty !
it's just mixing the HTTPS url with the content of the HTTP one.


our redirect is done by java code  response.sendRedirect(…), so it seams to be 
http redirect right ??

thx for helping me :)


 Date: Mon, 8 Mar 2010 15:51:34 +0100
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Content of redirected urls empty
 
 On 2010-03-08 14:55, BELLINI ADAM wrote:
 
 
  is there any idea guys ??
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: Content of redirected urls empty
  Date: Fri, 5 Mar 2010 22:01:05 +
 
 
 
  hi,
  the content of my redirected urls is empty...but still have the other 
  metadata...
  i have an http urls that is redirected to https.
  in my index i find the http URL but with an empty content...
  could you explain it plz?
 
 There are two ways to redirect - one is with protocol, and the other is 
 with content (either meta refresh, or javascript).
 
 When you dump the segment, is there really no content for the redirected 
 url?
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958

RE: Content of redirected urls empty

2010-03-08 Thread BELLINI ADAM

i'm sorry...i just checked twice...and in my index i have the original URL, 
which is  the HTTP one with the empty content...but it dosent index the HTTPS 
oneand i using solr index
thx



 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 8 Mar 2010 17:01:34 +
 
 
 
 
 Hi, i'v just dumped my segments and found that i have both 2 URLS, the 
 original one (HTTP) with an empty content and the REDIRCTED TO or the 
 DESTINATION URL (HTTPS) with NON EMPTY content !
 
 but in my search i found only the HTTPS URL with an empty content !! 
 logically the content of the HTTPS  URL is not empty !
 it's just mixing the HTTPS url with the content of the HTTP one.
 
 
 our redirect is done by java code  response.sendRedirect(…), so it seams to 
 be http redirect right ??
 
 thx for helping me :)
 
 
  Date: Mon, 8 Mar 2010 15:51:34 +0100
  From: a...@getopt.org
  To: nutch-user@lucene.apache.org
  Subject: Re: Content of redirected urls empty
  
  On 2010-03-08 14:55, BELLINI ADAM wrote:
  
  
   is there any idea guys ??
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: Content of redirected urls empty
   Date: Fri, 5 Mar 2010 22:01:05 +
  
  
  
   hi,
   the content of my redirected urls is empty...but still have the other 
   metadata...
   i have an http urls that is redirected to https.
   in my index i find the http URL but with an empty content...
   could you explain it plz?
  
  There are two ways to redirect - one is with protocol, and the other is 
  with content (either meta refresh, or javascript).
  
  When you dump the segment, is there really no content for the redirected 
  url?
  
  
  -- 
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
  
 
 _
 Live connected with Messenger on your phone
 http://go.microsoft.com/?linkid=9712958
  
_
IM on the go with Messenger on your phone
http://go.microsoft.com/?linkid=9712960

Content of redirected urls empty

2010-03-05 Thread BELLINI ADAM


hi,
the content of my redirected urls is empty...but stil have the other metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?
  
_
Check your Hotmail from your phone. 
http://go.microsoft.com/?linkid=9712957