Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2' getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport 80 (#0) * Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT Content-length: 0 Content-type: text/html Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... And the 301 is the critical thing here. D. Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please? library(RDCOMClient) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; ie - COMCreate(InternetExplorer.Application) txt - list() ie$Navigate(my.url) NULL while(ie[[Busy]]) Sys.sleep(1) txt[[my.url]] - ie[[document]][[body]][[innerText]] txt $`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges. sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1 __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
clair.crossup...@googlemail.com wrote: Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? The libcurl code typically defaults to print on the console. So on the Windows GUI, this will not show up. Using a shell (MS DOS window or Unix-like shell) should should cause the output to be displayed. A more general way however is to use the debugfunction option. d = debugGatherer() getURL(http://uk.youtube.com;, debugfunction = d$update, verbose = TRUE) When this completes, use d$value() and you have the entire contents that would be displayed on the console. D. library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2' getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport 80 (#0) * Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT Content-length: 0 Content-type: text/html Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... And the 301 is the critical thing here. D. Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please? library(RDCOMClient) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; ie - COMCreate(InternetExplorer.Application) txt - list() ie$Navigate(my.url) NULL while(ie[[Busy]]) Sys.sleep(1) txt[[my.url]] - ie[[document]][[body]][[innerText]] txt $`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges. sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1 __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
Thank you. The output i get from that example is below: d = debugGatherer() getURL(http://uk.youtube.com;, + debugfunction = d$update, verbose = TRUE ) [1] d$value() text About to connect() to uk.youtube.com port 80 (#0)\n Trying 208.117.236.72... connected\nConnected to uk.youtube.com (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com left intact\n headerIn HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- Content-Type-Options: nosniff\r\nCache-Control: no-cache\r \nCneonction: close\r\n\r\n headerOut GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n dataIn 0\r\n\r\n dataOut So the critical information from this is the '400 Bad Request'. A Google search defines this for me as: The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications. looking through sort(both listCurlOptions() and http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really help me this time (unless i missed something). Any advice? Thank you for your time, C.C P.S. I can get the d/l to work if i use: toString(readLines(http://www.uk.youtube.com;)) [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, , \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin: 0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes = new Array(16), \t\t\t\tbannersizes[0] = [etc] On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? The libcurl code typically defaults to print on the console. So on the Windows GUI, this will not show up. Using a shell (MS DOS window or Unix-like shell) should should cause the output to be displayed. A more general way however is to use the debugfunction option. d = debugGatherer() getURL(http://uk.youtube.com;, debugfunction = d$update, verbose = TRUE) When this completes, use d$value() and you have the entire contents that would be displayed on the console. D. library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport80 (#0) * Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT Content-length: 0 Content-type: text/html Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... And the 301 is the critical thing here. D. Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please? library(RDCOMClient) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; ie - COMCreate(InternetExplorer.Application) txt - list()
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
Some Web servers are strict. In this case, it won't accept a request without being told who is asking, i.e. the User-Agent. If you use getURL(http://www.youtube.com;, httpheader = c(User-Agent = R (2.9.0 you should get the contents of the page as expected. (Or with URL uk.youtube.com, etc.) D. clair.crossup...@googlemail.com wrote: Thank you. The output i get from that example is below: d = debugGatherer() getURL(http://uk.youtube.com;, + debugfunction = d$update, verbose = TRUE ) [1] d$value() text About to connect() to uk.youtube.com port 80 (#0)\n Trying 208.117.236.72... connected\nConnected to uk.youtube.com (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com left intact\n headerIn HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- Content-Type-Options: nosniff\r\nCache-Control: no-cache\r \nCneonction: close\r\n\r\n headerOut GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n dataIn 0\r\n\r\n dataOut So the critical information from this is the '400 Bad Request'. A Google search defines this for me as: The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications. looking through sort(both listCurlOptions() and http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really help me this time (unless i missed something). Any advice? Thank you for your time, C.C P.S. I can get the d/l to work if i use: toString(readLines(http://www.uk.youtube.com;)) [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, , \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin: 0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes = new Array(16), \t\t\t\tbannersizes[0] = [etc] On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? The libcurl code typically defaults to print on the console. So on the Windows GUI, this will not show up. Using a shell (MS DOS window or Unix-like shell) should should cause the output to be displayed. A more general way however is to use the debugfunction option. d = debugGatherer() getURL(http://uk.youtube.com;, debugfunction = d$update, verbose = TRUE) When this completes, use d$value() and you have the entire contents that would be displayed on the console. D. library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport80 (#0) * Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT Content-length: 0 Content-type: text/html Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... And the 301 is the critical thing here. D. Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
opps, i meant: toString(readLines(http://uk.youtube.com;)) toString(readLines(http://uk.youtube.com;)) [1] !DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\ \http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\;, , , \thtml lang=\en\, , !-- machid: 302 --, head, , \t, \ttitleYouTube - Broadcast Yourself./title, [etc] Warning message: In readLines(http://uk.youtube.com;) : incomplete final line found on 'http://uk.youtube.com' On 27 Jan, 16:02, clair.crossup...@googlemail.com clair.crossup...@googlemail.com wrote: Thank you. The output i get from that example is below: d = debugGatherer() getURL(http://uk.youtube.com;, + debugfunction = d$update, verbose = TRUE ) [1] d$value() text About to connect() to uk.youtube.com port 80 (#0)\n Trying 208.117.236.72... connected\nConnected to uk.youtube.com (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com left intact\n headerIn HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- Content-Type-Options: nosniff\r\nCache-Control: no-cache\r \nCneonction: close\r\n\r\n headerOut GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n dataIn 0\r\n\r\n dataOut So the critical information from this is the '400 Bad Request'. A Google search defines this for me as: The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications. looking through sort(both listCurlOptions() andhttp://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really help me this time (unless i missed something). Any advice? Thank you for your time, C.C P.S. I can get the d/l to work if i use: toString(readLines(http://www.uk.youtube.com;)) [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, , \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin: 0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes = new Array(16), \t\t\t\tbannersizes[0] = [etc] On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? The libcurl code typically defaults to print on the console. So on the Windows GUI, this will not show up. Using a shell (MS DOS window or Unix-like shell) should should cause the output to be displayed. A more general way however is to use the debugfunction option. d = debugGatherer() getURL(http://uk.youtube.com;, debugfunction = d$update, verbose = TRUE) When this completes, use d$value() and you have the entire contents that would be displayed on the console. D. library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport80(#0) * Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
Cheers Duncan, that worked great getURL(http://uk.youtube.com;, httpheader = c(User-Agent = R (2.8.1))) [1] !DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\ \http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\;\n\n\ [etc] May I ask if there was a specific manual you read to learn these things please? I do not think i could have worked that one out on my own. Thank you again for your time, C.C On 27 Jan, 16:46, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: Some Web servers are strict. In this case, it won't accept a request without being told who is asking, i.e. the User-Agent. If you use getURL(http://www.youtube.com;, httpheader = c(User-Agent = R (2.9.0 you should get the contents of the page as expected. (Or with URL uk.youtube.com, etc.) D. clair.crossup...@googlemail.com wrote: Thank you. The output i get from that example is below: d = debugGatherer() getURL(http://uk.youtube.com;, + debugfunction = d$update, verbose = TRUE ) [1] d$value() text About to connect() to uk.youtube.com port 80 (#0)\n Trying 208.117.236.72... connected\nConnected to uk.youtube.com (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com left intact\n headerIn HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- Content-Type-Options: nosniff\r\nCache-Control: no-cache\r \nCneonction: close\r\n\r\n headerOut GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n dataIn 0\r\n\r\n dataOut So the critical information from this is the '400 Bad Request'. A Google search defines this for me as: The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications. looking through sort(both listCurlOptions() and http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really help me this time (unless i missed something). Any advice? Thank you for your time, C.C P.S. I can get the d/l to work if i use: toString(readLines(http://www.uk.youtube.com;)) [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, , \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin: 0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes = new Array(16), \t\t\t\tbannersizes[0] = [etc] On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? The libcurl code typically defaults to print on the console. So on the Windows GUI, this will not show up. Using a shell (MS DOS window or Unix-like shell) should should cause the output to be displayed. A more general way however is to use the debugfunction option. d = debugGatherer() getURL(http://uk.youtube.com;, debugfunction = d$update, verbose = TRUE) When this completes, use d$value() and you have the entire contents that would be displayed on the console. D. library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport80(#0) * Trying 199.239.136.200... * connected *
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
clair.crossup...@googlemail.com wrote: Cheers Duncan, that worked great getURL(http://uk.youtube.com;, httpheader = c(User-Agent = R (2.8.1))) [1] !DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\ \http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\;\n\n\ [etc] May I ask if there was a specific manual you read to learn these things please? I do not think i could have worked that one out on my own. Unfortunately, other than reading the HTTP specification, I don't think there is a comprehensive manual for saying what should work and what might not. Much of this is subject to different levels of strictness and various policy choices. This particular one of no User-Agent is a somewhat common issue. So experience is a big component, but the libcurl documentation and the mailing lists are good resources. It is because of these variations, use of different protocols, cookies, etc. that RCurl is necessary when url() and download.file() don't allow enough customization. One of the useful tricks is to find a call (be it in R or a command-line utility such as wget or curl) that does work for a particular URL. Then use something like verbose/debug options, or tcpdump/wireshark or several others to observe the communication that succeeds and then the same for that call that didn't. Comparing the differences is a general way to hone in on the necessary invocation elements. D. Thank you again for your time, C.C On 27 Jan, 16:46, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: Some Web servers are strict. In this case, it won't accept a request without being told who is asking, i.e. the User-Agent. If you use getURL(http://www.youtube.com;, httpheader = c(User-Agent = R (2.9.0 you should get the contents of the page as expected. (Or with URL uk.youtube.com, etc.) D. clair.crossup...@googlemail.com wrote: Thank you. The output i get from that example is below: d = debugGatherer() getURL(http://uk.youtube.com;, + debugfunction = d$update, verbose = TRUE ) [1] d$value() text About to connect() to uk.youtube.com port 80 (#0)\n Trying 208.117.236.72... connected\nConnected to uk.youtube.com (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com left intact\n headerIn HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- Content-Type-Options: nosniff\r\nCache-Control: no-cache\r \nCneonction: close\r\n\r\n headerOut GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n dataIn 0\r\n\r\n dataOut So the critical information from this is the '400 Bad Request'. A Google search defines this for me as: The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications. looking through sort(both listCurlOptions() and http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really help me this time (unless i missed something). Any advice? Thank you for your time, C.C P.S. I can get the d/l to work if i use: toString(readLines(http://www.uk.youtube.com;)) [1] html, \thead, \t\ttitleOpenDNS/title, \t/head, , \tbody id=\mainbody\ onLoad=\testforbanner();\ style=\margin: 0px;\, \t\tscript language=\JavaScript\, \t\t\tfunction testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes = new Array(16), \t\t\t\tbannersizes[0] = [etc] On 27 Jan, 13:52, Duncan Temple Lang dun...@wald.ucdavis.edu wrote: clair.crossup...@googlemail.com wrote: Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)? The libcurl code typically defaults to print on the console. So on the Windows GUI, this will not show up. Using a shell (MS DOS window or Unix-like shell) should should cause the output to be displayed. A more general way however is to use the debugfunction option. d = debugGatherer() getURL(http://uk.youtube.com;, debugfunction = d$update, verbose = TRUE) When this completes, use d$value() and you have the entire contents that would be displayed on the console. D. library(RCurl) my.url - 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... getURL(my.url, verbose = TRUE) [1] I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C. On 26 Jan, 16:12, Duncan Temple Lang dun...@wald.ucdavis.edu
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
Hi, i ran your getURL example and had the same problem with downloading the file. ## R Start.. library(RCurl) toString(getURL(http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2;)) [1] ## R end. However, if it is interesting that if you manually save the page to your desktop, getURL works fine on it: ## R Start.. library(URL) toString(getURL('file:PFO-SBS001//Redirected//tonyb//Desktop//webpage.html')) [1] \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n!DOCTYPE HTML PUBLIC \-//W3C//DTD HTML 4.01 Transitional//EN\ \http://www.w3.org/TR/html4/loose.dtd\; \nhtml\nhead\n\ [etc...] ## R end. very strange indeed.I use RCurl for web crawling every now and again so i would be interested in knowing why this happens too :-) Tony Breyal On 26 Jan, 13:58, clair.crossup...@googlemail.com clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; getURL(my.url) [1] Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please? library(RDCOMClient) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...; ie - COMCreate(InternetExplorer.Application) txt - list() ie$Navigate(my.url) NULL while(ie[[Busy]]) Sys.sleep(1) txt[[my.url]] - ie[[document]][[body]][[innerText]] txt $`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges. sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1 __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() to www.nytimes.com port 80 (#0) * Trying 199.239.136.200... * connected * Connected to www.nytimes.com (199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host: www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT Content-length: 0 Content-type: text/html Location: http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.htmlOQ=_rQ3D3op=42fceb38q2fq5duarq5d3-z8q26--q24jq5djccq7bq5dcmq5dc1q5dq24...@-f-q2anq5dry8h@a88q3dz-dbyq...@q2aq5dc1bq26-q2aq26q5bddfq24df And the 301 is the critical thing here. D. Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please? library(RDCOMClient) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2; ie - COMCreate(InternetExplorer.Application) txt - list() ie$Navigate(my.url) NULL while(ie[[Busy]]) Sys.sleep(1) txt[[my.url]] - ie[[document]][[body]][[innerText]] txt $`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges. sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?
Duncan Temple Lang wrote: clair.crossup...@googlemail.com wrote: Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download: library(RCurl) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2; getURL(my.url) [1] I like the irony that RCurl seems to have difficulties downloading an article about R. Good thing it is just a matter of additional arguments to getURL() or it would be bad news. Don't forget the irony that https is supported in url() and download.file() on Windows but not UNIX... http://tolstoy.newcastle.edu.au/R/e2/devel/07/01/1634.html Jeff The followlocation parameter defaults to FALSE, so getURL(my.url, followlocation = TRUE) gets what you want. The way I found this is getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() to www.nytimes.com port 80 (#0) * Trying 199.239.136.200... * connected * Connected to www.nytimes.com (199.239.136.200) port 80 (#0) GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host: www.nytimes.com Accept: */* HTTP/1.1 301 Moved Permanently Server: Sun-ONE-Web-Server/6.1 Date: Mon, 26 Jan 2009 16:10:51 GMT Content-length: 0 Content-type: text/html Location: http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/technology/business-computing/07program.htmlOQ=_rQ3D3op=42fceb38q2fq5duarq5d3-z8q26--q24jq5djccq7bq5dcmq5dc1q5dq24...@-f-q2anq5dry8h@a88q3dz-dbyq...@q2aq5dc1bq26-q2aq26q5bddfq24df And the 301 is the critical thing here. D. Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please? library(RDCOMClient) my.url - http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2; ie - COMCreate(InternetExplorer.Application) txt - list() ie$Navigate(my.url) NULL while(ie[[Busy]]) Sys.sleep(1) txt[[my.url]] - ie[[document]][[body]][[innerText]] txt $`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges. sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.