Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread clair.crossup...@googlemail.com
Thank you Duncan.

I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?

 library(RCurl)
 my.url - 
 'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2'
 getURL(my.url, verbose = TRUE)
[1] 



I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.

Many thanks for your time,
C.C.


On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
 clair.crossup...@googlemail.com wrote:
  Dear R-help,

  There seems to be a web page I am unable to download using RCurl. I
  don't understand why it won't download:

  library(RCurl)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  getURL(my.url)
  [1] 

   I like the irony that RCurl seems to have difficulties downloading an
 article about R.  Good thing it is just a matter of additional arguments
 to getURL() or it would be bad news.

 The followlocation parameter defaults to FALSE, so

    getURL(my.url, followlocation = TRUE)

 gets what you want.

 The way I found this  is

   getURL(my.url, verbose = TRUE)

 and take a look at the information being sent from R
 and received by R from the server.

 This gives

 * About to connect() towww.nytimes.comport 80 (#0)
 *   Trying 199.239.136.200... * connected
 * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
   GET /2009/01/07/technology/business-computing/07program.html?_r=2
 HTTP/1.1
 Host:www.nytimes.com
 Accept: */*

  HTTP/1.1 301 Moved Permanently
  Server: Sun-ONE-Web-Server/6.1
  Date: Mon, 26 Jan 2009 16:10:51 GMT
  Content-length: 0
  Content-type: text/html
  
 Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
 

 And the 301 is the critical thing here.

   D.



  Other web pages are ok to download but this is the first time I have
  been unable to download a web page using the very nice RCurl package.
  While i can download the webpage using the RDCOMClient, i would like
  to understand why it doesn't work as above please?

  library(RDCOMClient)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  ie - COMCreate(InternetExplorer.Application)
  txt - list()
  ie$Navigate(my.url)
  NULL
  while(ie[[Busy]]) Sys.sleep(1)
  txt[[my.url]] - ie[[document]][[body]][[innerText]]
  txt
  $`http://www.nytimes.com/2009/01/07/technology/business-computing/
  07program.html?_r=2`
  [1] Skip to article Try Electronic Edition Log ...

  Many thanks for your time,
  C.C

  Windows Vista, running with administrator privileges.
  sessionInfo()
  R version 2.8.1 (2008-12-22)
  i386-pc-mingw32

  locale:
  LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
  1252;LC_MONETARY=English_United Kingdom.
  1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods
  base

  other attached packages:
  [1] RDCOMClient_0.92-0 RCurl_0.94-0

  loaded via a namespace (and not attached):
  [1] tools_2.8.1

  __
  r-h...@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread Duncan Temple Lang



clair.crossup...@googlemail.com wrote:

Thank you Duncan.

I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?



The libcurl code typically defaults to print on the console.
So on the Windows GUI, this will not show up. Using
a shell (MS DOS window or Unix-like shell) should
should cause the output to be displayed.

A more general way however is to use the debugfunction
option.

d = debugGatherer()

getURL(http://uk.youtube.com;,
debugfunction = d$update, verbose = TRUE)

When this completes, use

 d$value()

and you have the entire contents that would be displayed on the console.


 D.




library(RCurl)
my.url - 
'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2'
getURL(my.url, verbose = TRUE)

[1] 


I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.

Many thanks for your time,
C.C.


On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:

clair.crossup...@googlemail.com wrote:

Dear R-help,
There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:

library(RCurl)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
getURL(my.url)

[1] 

  I like the irony that RCurl seems to have difficulties downloading an
article about R.  Good thing it is just a matter of additional arguments
to getURL() or it would be bad news.

The followlocation parameter defaults to FALSE, so

   getURL(my.url, followlocation = TRUE)

gets what you want.

The way I found this  is

  getURL(my.url, verbose = TRUE)

and take a look at the information being sent from R
and received by R from the server.

This gives

* About to connect() towww.nytimes.comport 80 (#0)
*   Trying 199.239.136.200... * connected
* Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
  GET /2009/01/07/technology/business-computing/07program.html?_r=2
HTTP/1.1
Host:www.nytimes.com
Accept: */*

 HTTP/1.1 301 Moved Permanently
 Server: Sun-ONE-Web-Server/6.1
 Date: Mon, 26 Jan 2009 16:10:51 GMT
 Content-length: 0
 Content-type: text/html
 
Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...


And the 301 is the critical thing here.

  D.




Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to understand why it doesn't work as above please?

library(RDCOMClient)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
ie - COMCreate(InternetExplorer.Application)
txt - list()
ie$Navigate(my.url)

NULL

while(ie[[Busy]]) Sys.sleep(1)
txt[[my.url]] - ie[[document]][[body]][[innerText]]
txt

$`http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`
[1] Skip to article Try Electronic Edition Log ...
Many thanks for your time,
C.C
Windows Vista, running with administrator privileges.

sessionInfo()

R version 2.8.1 (2008-12-22)
i386-pc-mingw32
locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics  grDevices utils datasets  methods
base
other attached packages:
[1] RDCOMClient_0.92-0 RCurl_0.94-0
loaded via a namespace (and not attached):
[1] tools_2.8.1
__
r-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread clair.crossup...@googlemail.com
Thank you. The output i get from that example is below:

 d = debugGatherer()
 getURL(http://uk.youtube.com;,
+  debugfunction = d$update, verbose = TRUE )
[1] 
 d$value()
 
text
About to connect() to uk.youtube.com port 80 (#0)\n  Trying
208.117.236.72... connected\nConnected to uk.youtube.com
(208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
left intact\n
 
headerIn
HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
\nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
\nCneonction: close\r\n\r\n
 
headerOut
GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n
 
dataIn
0\r\n\r\n
 
dataOut



So the critical information from this is the '400 Bad Request'. A
Google search defines this for me as:

The request could not be understood by the server due to malformed
syntax. The client SHOULD NOT repeat the request without
modifications.


looking through sort(both listCurlOptions() and
http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
help me this time (unless i missed something). Any advice?

Thank you for your time,
C.C

P.S. I can get the d/l to work if i use:
 toString(readLines(http://www.uk.youtube.com;))
[1] html, \thead, \t\ttitleOpenDNS/title, \t/head, ,
\tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin:
0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction
testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
\tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
new Array(16), \t\t\t\tbannersizes[0] = [etc]





On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
 clair.crossup...@googlemail.com wrote:
  Thank you Duncan.

  I remember seeing in your documentation that you have used this
  'verbose=TRUE' argument in functions before when trying to see what is
  going on. This is good. However, I have not been able to get it to
  work for me. Does the output appear in R or do you use some other
  external window (i.e. MS DOS window?)?

 The libcurl code typically defaults to print on the console.
 So on the Windows GUI, this will not show up. Using
 a shell (MS DOS window or Unix-like shell) should
 should cause the output to be displayed.

 A more general way however is to use the debugfunction
 option.

 d = debugGatherer()

 getURL(http://uk.youtube.com;,
          debugfunction = d$update, verbose = TRUE)

 When this completes, use

   d$value()

 and you have the entire contents that would be displayed on the console.

   D.



  library(RCurl)
  my.url - 
  'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
  getURL(my.url, verbose = TRUE)
  [1] 

  I am having a problem with a new webpage (http://uk.youtube.com/) but
  if i can get this verbose to work, then i think i will be able to
  google the right action to take based on the information it gives.

  Many thanks for your time,
  C.C.

  On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
  clair.crossup...@googlemail.com wrote:
  Dear R-help,
  There seems to be a web page I am unable to download using RCurl. I
  don't understand why it won't download:
  library(RCurl)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  getURL(my.url)
  [1] 
    I like the irony that RCurl seems to have difficulties downloading an
  article about R.  Good thing it is just a matter of additional arguments
  to getURL() or it would be bad news.

  The followlocation parameter defaults to FALSE, so

     getURL(my.url, followlocation = TRUE)

  gets what you want.

  The way I found this  is

    getURL(my.url, verbose = TRUE)

  and take a look at the information being sent from R
  and received by R from the server.

  This gives

  * About to connect() towww.nytimes.comport80 (#0)
  *   Trying 199.239.136.200... * connected
  * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
    GET /2009/01/07/technology/business-computing/07program.html?_r=2
  HTTP/1.1
  Host:www.nytimes.com
  Accept: */*

   HTTP/1.1 301 Moved Permanently
   Server: Sun-ONE-Web-Server/6.1
   Date: Mon, 26 Jan 2009 16:10:51 GMT
   Content-length: 0
   Content-type: text/html
   
  Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
  

  And the 301 is the critical thing here.

    D.

  Other web pages are ok to download but this is the first time I have
  been unable to download a web page using the very nice RCurl package.
  While i can download the webpage using the RDCOMClient, i would like
  to understand why it doesn't work as above please?
  library(RDCOMClient)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  ie - COMCreate(InternetExplorer.Application)
  txt - list()
  

Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread Duncan Temple Lang


Some Web servers are strict. In this case, it won't accept
a request without being told who is asking, i.e. the User-Agent.

If you use

 getURL(http://www.youtube.com;,
  httpheader = c(User-Agent = R (2.9.0

you should get the contents of the page as expected.


(Or with URL uk.youtube.com, etc.)


 D.


clair.crossup...@googlemail.com wrote:

Thank you. The output i get from that example is below:


d = debugGatherer()
getURL(http://uk.youtube.com;,

+  debugfunction = d$update, verbose = TRUE )
[1] 

d$value()
 
text

About to connect() to uk.youtube.com port 80 (#0)\n  Trying
208.117.236.72... connected\nConnected to uk.youtube.com
(208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
left intact\n
 
headerIn

HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
\nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
\nCneonction: close\r\n\r\n
 
headerOut

GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n
 
dataIn

0\r\n\r\n
 
dataOut



So the critical information from this is the '400 Bad Request'. A
Google search defines this for me as:

The request could not be understood by the server due to malformed
syntax. The client SHOULD NOT repeat the request without
modifications.


looking through sort(both listCurlOptions() and
http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
help me this time (unless i missed something). Any advice?

Thank you for your time,
C.C

P.S. I can get the d/l to work if i use:

toString(readLines(http://www.uk.youtube.com;))

[1] html, \thead, \t\ttitleOpenDNS/title, \t/head, ,
\tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin:
0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction
testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
\tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
new Array(16), \t\t\t\tbannersizes[0] = [etc]




On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:

clair.crossup...@googlemail.com wrote:

Thank you Duncan.
I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?

The libcurl code typically defaults to print on the console.
So on the Windows GUI, this will not show up. Using
a shell (MS DOS window or Unix-like shell) should
should cause the output to be displayed.

A more general way however is to use the debugfunction
option.

d = debugGatherer()

getURL(http://uk.youtube.com;,
 debugfunction = d$update, verbose = TRUE)

When this completes, use

  d$value()

and you have the entire contents that would be displayed on the console.

  D.




library(RCurl)
my.url - 
'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
getURL(my.url, verbose = TRUE)

[1] 
I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.
Many thanks for your time,
C.C.
On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:

clair.crossup...@googlemail.com wrote:

Dear R-help,
There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:

library(RCurl)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
getURL(my.url)

[1] 

  I like the irony that RCurl seems to have difficulties downloading an
article about R.  Good thing it is just a matter of additional arguments
to getURL() or it would be bad news.
The followlocation parameter defaults to FALSE, so
   getURL(my.url, followlocation = TRUE)
gets what you want.
The way I found this  is
  getURL(my.url, verbose = TRUE)
and take a look at the information being sent from R
and received by R from the server.
This gives
* About to connect() towww.nytimes.comport80 (#0)
*   Trying 199.239.136.200... * connected
* Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
  GET /2009/01/07/technology/business-computing/07program.html?_r=2
HTTP/1.1
Host:www.nytimes.com
Accept: */*
 HTTP/1.1 301 Moved Permanently
 Server: Sun-ONE-Web-Server/6.1
 Date: Mon, 26 Jan 2009 16:10:51 GMT
 Content-length: 0
 Content-type: text/html
 
Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...

And the 301 is the critical thing here.
  D.

Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to 

Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread clair.crossup...@googlemail.com
opps, i meant:

toString(readLines(http://uk.youtube.com;))
 toString(readLines(http://uk.youtube.com;))
[1] !DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\
\http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\;, , ,
\thtml lang=\en\, , !-- machid: 302 --, head, , \t,
\ttitleYouTube - Broadcast Yourself./title,
[etc]
Warning message:
In readLines(http://uk.youtube.com;) :
  incomplete final line found on 'http://uk.youtube.com'


On 27 Jan, 16:02, clair.crossup...@googlemail.com
clair.crossup...@googlemail.com wrote:
 Thank you. The output i get from that example is below:

  d = debugGatherer()
  getURL(http://uk.youtube.com;,

 +          debugfunction = d$update, verbose = TRUE )
 [1] 

  d$value()

 text
 About to connect() to uk.youtube.com port 80 (#0)\n  Trying
 208.117.236.72... connected\nConnected to uk.youtube.com
 (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
 left intact\n

 headerIn
 HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
 Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
 \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
 Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
 \nCneonction: close\r\n\r\n

 headerOut
 GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n

 dataIn
 0\r\n\r\n

 dataOut
 



 So the critical information from this is the '400 Bad Request'. A
 Google search defines this for me as:

     The request could not be understood by the server due to malformed
     syntax. The client SHOULD NOT repeat the request without
 modifications.

 looking through sort(both listCurlOptions() 
 andhttp://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
 help me this time (unless i missed something). Any advice?

 Thank you for your time,
 C.C

 P.S. I can get the d/l to work if i use: 
 toString(readLines(http://www.uk.youtube.com;))

 [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, ,
 \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin:
 0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction
 testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
 \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
 new Array(16), \t\t\t\tbannersizes[0] = [etc]



 On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:



  clair.crossup...@googlemail.com wrote:
   Thank you Duncan.

   I remember seeing in your documentation that you have used this
   'verbose=TRUE' argument in functions before when trying to see what is
   going on. This is good. However, I have not been able to get it to
   work for me. Does the output appear in R or do you use some other
   external window (i.e. MS DOS window?)?

  The libcurl code typically defaults to print on the console.
  So on the Windows GUI, this will not show up. Using
  a shell (MS DOS window or Unix-like shell) should
  should cause the output to be displayed.

  A more general way however is to use the debugfunction
  option.

  d = debugGatherer()

  getURL(http://uk.youtube.com;,
           debugfunction = d$update, verbose = TRUE)

  When this completes, use

    d$value()

  and you have the entire contents that would be displayed on the console.

    D.

   library(RCurl)
   my.url - 
   'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
   getURL(my.url, verbose = TRUE)
   [1] 

   I am having a problem with a new webpage (http://uk.youtube.com/) but
   if i can get this verbose to work, then i think i will be able to
   google the right action to take based on the information it gives.

   Many thanks for your time,
   C.C.

   On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
   clair.crossup...@googlemail.com wrote:
   Dear R-help,
   There seems to be a web page I am unable to download using RCurl. I
   don't understand why it won't download:
   library(RCurl)
   my.url - 
   http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
   getURL(my.url)
   [1] 
     I like the irony that RCurl seems to have difficulties downloading an
   article about R.  Good thing it is just a matter of additional arguments
   to getURL() or it would be bad news.

   The followlocation parameter defaults to FALSE, so

      getURL(my.url, followlocation = TRUE)

   gets what you want.

   The way I found this  is

     getURL(my.url, verbose = TRUE)

   and take a look at the information being sent from R
   and received by R from the server.

   This gives

   * About to connect() towww.nytimes.comport80(#0)
   *   Trying 199.239.136.200... * connected
   * Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
     GET /2009/01/07/technology/business-computing/07program.html?_r=2
   HTTP/1.1
   Host:www.nytimes.com
   Accept: */*

HTTP/1.1 301 Moved Permanently
Server: Sun-ONE-Web-Server/6.1
Date: Mon, 26 Jan 2009 16:10:51 GMT

Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread clair.crossup...@googlemail.com
Cheers Duncan, that worked great

 getURL(http://uk.youtube.com;, httpheader = c(User-Agent = R (2.8.1)))
[1] !DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\
\http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\;\n\n\
[etc]

May I ask if there was a specific manual you read to learn these
things please? I do not think i could have worked that one out on my
own.

Thank you again for your time,
C.C

On 27 Jan, 16:46, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
 Some Web servers are strict. In this case, it won't accept
 a request without being told who is asking, i.e. the User-Agent.

 If you use

   getURL(http://www.youtube.com;,
            httpheader = c(User-Agent = R (2.9.0

 you should get the contents of the page as expected.

 (Or with URL uk.youtube.com, etc.)

   D.



 clair.crossup...@googlemail.com wrote:
  Thank you. The output i get from that example is below:

  d = debugGatherer()
  getURL(http://uk.youtube.com;,
  +          debugfunction = d$update, verbose = TRUE )
  [1] 
  d$value()

  text
  About to connect() to uk.youtube.com port 80 (#0)\n  Trying
  208.117.236.72... connected\nConnected to uk.youtube.com
  (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
  left intact\n

  headerIn
  HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
  Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
  \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
  15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
  Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
  \nCneonction: close\r\n\r\n

  headerOut
  GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n

  dataIn
  0\r\n\r\n

  dataOut
  

  So the critical information from this is the '400 Bad Request'. A
  Google search defines this for me as:

      The request could not be understood by the server due to malformed
      syntax. The client SHOULD NOT repeat the request without
  modifications.

  looking through sort(both listCurlOptions() and
 http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
  help me this time (unless i missed something). Any advice?

  Thank you for your time,
  C.C

  P.S. I can get the d/l to work if i use:
  toString(readLines(http://www.uk.youtube.com;))
  [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, ,
  \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin:
  0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction
  testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
  \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
  new Array(16), \t\t\t\tbannersizes[0] = [etc]

  On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
  clair.crossup...@googlemail.com wrote:
  Thank you Duncan.
  I remember seeing in your documentation that you have used this
  'verbose=TRUE' argument in functions before when trying to see what is
  going on. This is good. However, I have not been able to get it to
  work for me. Does the output appear in R or do you use some other
  external window (i.e. MS DOS window?)?
  The libcurl code typically defaults to print on the console.
  So on the Windows GUI, this will not show up. Using
  a shell (MS DOS window or Unix-like shell) should
  should cause the output to be displayed.

  A more general way however is to use the debugfunction
  option.

  d = debugGatherer()

  getURL(http://uk.youtube.com;,
           debugfunction = d$update, verbose = TRUE)

  When this completes, use

    d$value()

  and you have the entire contents that would be displayed on the console.

    D.

  library(RCurl)
  my.url - 
  'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
  getURL(my.url, verbose = TRUE)
  [1] 
  I am having a problem with a new webpage (http://uk.youtube.com/) but
  if i can get this verbose to work, then i think i will be able to
  google the right action to take based on the information it gives.
  Many thanks for your time,
  C.C.
  On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:
  clair.crossup...@googlemail.com wrote:
  Dear R-help,
  There seems to be a web page I am unable to download using RCurl. I
  don't understand why it won't download:
  library(RCurl)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  getURL(my.url)
  [1] 
    I like the irony that RCurl seems to have difficulties downloading an
  article about R.  Good thing it is just a matter of additional arguments
  to getURL() or it would be bad news.
  The followlocation parameter defaults to FALSE, so
     getURL(my.url, followlocation = TRUE)
  gets what you want.
  The way I found this  is
    getURL(my.url, verbose = TRUE)
  and take a look at the information being sent from R
  and received by R from the server.
  This gives
  * About to connect() towww.nytimes.comport80(#0)
  *   Trying 199.239.136.200... * connected
  * 

Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-27 Thread Duncan Temple Lang



clair.crossup...@googlemail.com wrote:

Cheers Duncan, that worked great


getURL(http://uk.youtube.com;, httpheader = c(User-Agent = R (2.8.1)))

[1] !DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\
\http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\;\n\n\
[etc]

May I ask if there was a specific manual you read to learn these
things please? I do not think i could have worked that one out on my
own.


Unfortunately, other than reading the HTTP specification,
I don't think there is a comprehensive manual for saying
what should work and what might not.  Much of this is
subject to different levels of strictness and various
policy choices.

This particular one of no User-Agent is a somewhat common
issue. So experience is a big component, but
the libcurl documentation and the mailing
lists are good resources.

It is because of these variations, use of different protocols,
cookies, etc.  that RCurl is necessary when
url() and download.file() don't allow enough customization.

One of the useful tricks is to
find a call (be it in R or a command-line utility such as
wget or curl) that does work for a particular URL.
Then use something like verbose/debug options,
or tcpdump/wireshark or several others to observe
the communication that succeeds and then the same
for that call that didn't.  Comparing the differences
is a general way to hone in on the necessary invocation
elements.

 D.



Thank you again for your time,
C.C

On 27 Jan, 16:46, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:

Some Web servers are strict. In this case, it won't accept
a request without being told who is asking, i.e. the User-Agent.

If you use

  getURL(http://www.youtube.com;,
   httpheader = c(User-Agent = R (2.9.0

you should get the contents of the page as expected.

(Or with URL uk.youtube.com, etc.)

  D.



clair.crossup...@googlemail.com wrote:

Thank you. The output i get from that example is below:

d = debugGatherer()
getURL(http://uk.youtube.com;,

+  debugfunction = d$update, verbose = TRUE )
[1] 

d$value()

text
About to connect() to uk.youtube.com port 80 (#0)\n  Trying
208.117.236.72... connected\nConnected to uk.youtube.com
(208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
left intact\n
headerIn
HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
\nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
\nCneonction: close\r\n\r\n
headerOut
GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n
dataIn
0\r\n\r\n
dataOut

So the critical information from this is the '400 Bad Request'. A
Google search defines this for me as:
The request could not be understood by the server due to malformed
syntax. The client SHOULD NOT repeat the request without
modifications.
looking through sort(both listCurlOptions() and
http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
help me this time (unless i missed something). Any advice?
Thank you for your time,
C.C
P.S. I can get the d/l to work if i use:

toString(readLines(http://www.uk.youtube.com;))

[1] html, \thead, \t\ttitleOpenDNS/title, \t/head, ,
\tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin:
0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction
testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
\tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
new Array(16), \t\t\t\tbannersizes[0] = [etc]
On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote:

clair.crossup...@googlemail.com wrote:

Thank you Duncan.
I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?

The libcurl code typically defaults to print on the console.
So on the Windows GUI, this will not show up. Using
a shell (MS DOS window or Unix-like shell) should
should cause the output to be displayed.
A more general way however is to use the debugfunction
option.
d = debugGatherer()
getURL(http://uk.youtube.com;,
 debugfunction = d$update, verbose = TRUE)
When this completes, use
  d$value()
and you have the entire contents that would be displayed on the console.
  D.

library(RCurl)
my.url - 
'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
getURL(my.url, verbose = TRUE)

[1] 
I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.
Many thanks for your time,
C.C.
On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu 

Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-26 Thread Tony Breyal
Hi, i ran your getURL example and had the same problem with
downloading the file.

## R Start..
 library(RCurl)
 toString(getURL(http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2;))
[1] 
## R end.

However, if it is interesting that if  you manually save the page to
your desktop, getURL works fine on it:

## R Start..
 library(URL)
 toString(getURL('file:PFO-SBS001//Redirected//tonyb//Desktop//webpage.html'))
[1] \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n!DOCTYPE HTML PUBLIC \-//W3C//DTD
HTML 4.01 Transitional//EN\ \http://www.w3.org/TR/html4/loose.dtd\;
\nhtml\nhead\n\
[etc...]
## R end.


very strange indeed.I use RCurl for web crawling every now and again
so i would be interested in knowing why this happens too :-)

Tony Breyal



On 26 Jan, 13:58, clair.crossup...@googlemail.com
clair.crossup...@googlemail.com wrote:
 Dear R-help,

 There seems to be a web page I am unable to download using RCurl. I
 don't understand why it won't download:

  library(RCurl)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  getURL(my.url)

 [1] 

 Other web pages are ok to download but this is the first time I have
 been unable to download a web page using the very nice RCurl package.
 While i can download the webpage using the RDCOMClient, i would like
 to understand why it doesn't work as above please?

  library(RDCOMClient)
  my.url - 
  http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...;
  ie - COMCreate(InternetExplorer.Application)
  txt - list()
  ie$Navigate(my.url)
 NULL
  while(ie[[Busy]]) Sys.sleep(1)
  txt[[my.url]] - ie[[document]][[body]][[innerText]]
  txt

 $`http://www.nytimes.com/2009/01/07/technology/business-computing/
 07program.html?_r=2`
 [1] Skip to article Try Electronic Edition Log ...

 Many thanks for your time,
 C.C

 Windows Vista, running with administrator privileges. sessionInfo()

 R version 2.8.1 (2008-12-22)
 i386-pc-mingw32

 locale:
 LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
 1252;LC_MONETARY=English_United Kingdom.
 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods
 base

 other attached packages:
 [1] RDCOMClient_0.92-0 RCurl_0.94-0

 loaded via a namespace (and not attached):
 [1] tools_2.8.1

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-26 Thread Duncan Temple Lang



clair.crossup...@googlemail.com wrote:

Dear R-help,

There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:


library(RCurl)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2;
getURL(my.url)

[1] 




 I like the irony that RCurl seems to have difficulties downloading an 
article about R.  Good thing it is just a matter of additional arguments

to getURL() or it would be bad news.


The followlocation parameter defaults to FALSE, so

  getURL(my.url, followlocation = TRUE)

gets what you want.

The way I found this  is

 getURL(my.url, verbose = TRUE)

and take a look at the information being sent from R
and received by R from the server.

This gives

* About to connect() to www.nytimes.com port 80 (#0)
*   Trying 199.239.136.200... * connected
* Connected to www.nytimes.com (199.239.136.200) port 80 (#0)
 GET /2009/01/07/technology/business-computing/07program.html?_r=2 
HTTP/1.1

Host: www.nytimes.com
Accept: */*

 HTTP/1.1 301 Moved Permanently
 Server: Sun-ONE-Web-Server/6.1
 Date: Mon, 26 Jan 2009 16:10:51 GMT
 Content-length: 0
 Content-type: text/html
 Location: 
http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.htmlOQ=_rQ3D3op=42fceb38q2fq5duarq5d3-z8q26--q24jq5djccq7bq5dcmq5dc1q5dq24...@-f-q2anq5dry8h@a88q3dz-dbyq...@q2aq5dc1bq26-q2aq26q5bddfq24df



And the 301 is the critical thing here.

 D.



Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to understand why it doesn't work as above please?





library(RDCOMClient)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2;
ie - COMCreate(InternetExplorer.Application)
txt - list()
ie$Navigate(my.url)

NULL

while(ie[[Busy]]) Sys.sleep(1)
txt[[my.url]] - ie[[document]][[body]][[innerText]]
txt

$`http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`
[1] Skip to article Try Electronic Edition Log ...


Many thanks for your time,
C.C

Windows Vista, running with administrator privileges.

sessionInfo()

R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods
base

other attached packages:
[1] RDCOMClient_0.92-0 RCurl_0.94-0

loaded via a namespace (and not attached):
[1] tools_2.8.1

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

2009-01-26 Thread Jeffrey Horner

Duncan Temple Lang wrote:



clair.crossup...@googlemail.com wrote:

Dear R-help,

There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:


library(RCurl)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2; 


getURL(my.url)

[1] 




 I like the irony that RCurl seems to have difficulties downloading an 
article about R.  Good thing it is just a matter of additional arguments

to getURL() or it would be bad news.
Don't forget the irony that https is supported in url() and 
download.file() on Windows but not UNIX...


http://tolstoy.newcastle.edu.au/R/e2/devel/07/01/1634.html

Jeff



The followlocation parameter defaults to FALSE, so

  getURL(my.url, followlocation = TRUE)

gets what you want.

The way I found this  is

 getURL(my.url, verbose = TRUE)

and take a look at the information being sent from R
and received by R from the server.

This gives

* About to connect() to www.nytimes.com port 80 (#0)
*   Trying 199.239.136.200... * connected
* Connected to www.nytimes.com (199.239.136.200) port 80 (#0)
 GET /2009/01/07/technology/business-computing/07program.html?_r=2 
HTTP/1.1

Host: www.nytimes.com
Accept: */*

 HTTP/1.1 301 Moved Permanently
 Server: Sun-ONE-Web-Server/6.1
 Date: Mon, 26 Jan 2009 16:10:51 GMT
 Content-length: 0
 Content-type: text/html
 Location: 
http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.htmlOQ=_rQ3D3op=42fceb38q2fq5duarq5d3-z8q26--q24jq5djccq7bq5dcmq5dc1q5dq24...@-f-q2anq5dry8h@a88q3dz-dbyq...@q2aq5dc1bq26-q2aq26q5bddfq24df 




And the 301 is the critical thing here.

 D.



Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to understand why it doesn't work as above please?





library(RDCOMClient)
my.url - 
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2; 


ie - COMCreate(InternetExplorer.Application)
txt - list()
ie$Navigate(my.url)

NULL

while(ie[[Busy]]) Sys.sleep(1)
txt[[my.url]] - ie[[document]][[body]][[innerText]]
txt

$`http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`
[1] Skip to article Try Electronic Edition Log ...


Many thanks for your time,
C.C

Windows Vista, running with administrator privileges.

sessionInfo()

R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods
base

other attached packages:
[1] RDCOMClient_0.92-0 RCurl_0.94-0

loaded via a namespace (and not attached):
[1] tools_2.8.1

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.