Re: [R] readHTMLTable() in XML package

2015-03-03 Thread Doran, Harold
Hadley

Thanks. I ran into the same roadblock when I use your code below by increasing 
i to loop over all pages. I think the problem is related to the fact that the 
website I'm scraping is getting hammered with users and the error is just 
related to a timeout.

I have provisionally solved my problem by wrapping in some try() statements in 
appropriate places and some conditional if/else statements to skip over steps 
if a timeout occurs. Not sure if this is elegant, but my sledgehammer approach 
is working now.



-Original Message-
From: Hadley Wickham [mailto:h.wick...@gmail.com] 
Sent: Monday, March 02, 2015 2:05 PM
To: Doran, Harold
Cc: r-help@r-project.org
Subject: Re: [R] readHTMLTable() in XML package

This somewhat simpler rvest code does the trick for me:

library(rvest)
library(dplyr)

i - 1:10
urls - paste0('http://games.crossfit.com/scores/leaderboard.php?stage=5',
  'sort=0division=1region=0numberperpage=100competition=0frontpage=0',
  'expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1',
  'is_mobile=0page=', i)

results_table - function(url) {
  url %% html %% html_table(fill = TRUE) %% .[[1]] }

results - lapply(urls, results_table)
out - results %% bind_rows()

Hadley

On Mon, Mar 2, 2015 at 10:00 AM, Doran, Harold hdo...@air.org wrote:
 I'm having trouble pulling down data from a website with my code below as I 
 keep encountering the same error, but the error occurs on different pages.

 My code below loops through a wensite and grabs data from the html table. The 
 error appears on different pages at different times and I'm not sure of the 
 root cause.

 Error in readHTMLTable(readLines(url), which = 1, header = TRUE) :
   error in evaluating the argument 'doc' in selecting a method for function 
 'readHTMLTable': Error in readHTMLTable(readLines(url), which = 1, header = 
 TRUE) :
   error in evaluating the argument 'doc' in selecting a method for function 
 'readHTMLTable':

 library(XML)
 for(i in 1:1000){
 url - 
 paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5sort=0page=',
  i, sep=''), 
 'division=1region=0numberperpage=100competition=0frontpage=0expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1=is_mobile=0',
  sep='')
 tmp - readHTMLTable(readLines(url), which=1, header=TRUE)
 names(tmp) - gsub(\\n, , names(tmp))
 names(tmp) - gsub( +, , names(tmp))
 tmp[] - lapply(tmp, function(x) gsub(\\n, , x))

 if(i == 1){
 dat - tmp
 } else {
 dat - rbind(dat, tmp)
 }
 cat('Grabbing data from page', i, '\n')
 }

 Thanks,
 Harold

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



--
http://had.co.nz/
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] readHTMLTable() in XML package

2015-03-02 Thread Doran, Harold
I'm having trouble pulling down data from a website with my code below as I 
keep encountering the same error, but the error occurs on different pages.

My code below loops through a wensite and grabs data from the html table. The 
error appears on different pages at different times and I'm not sure of the 
root cause.

Error in readHTMLTable(readLines(url), which = 1, header = TRUE) :
  error in evaluating the argument 'doc' in selecting a method for function 
'readHTMLTable': Error in readHTMLTable(readLines(url), which = 1, header = 
TRUE) :
  error in evaluating the argument 'doc' in selecting a method for function 
'readHTMLTable':

library(XML)
for(i in 1:1000){
url - 
paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5sort=0page=',
 i, sep=''), 
'division=1region=0numberperpage=100competition=0frontpage=0expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1=is_mobile=0',
 sep='')
tmp - readHTMLTable(readLines(url), which=1, header=TRUE)
names(tmp) - gsub(\\n, , names(tmp))
names(tmp) - gsub( +, , names(tmp))
tmp[] - lapply(tmp, function(x) gsub(\\n, , x))

if(i == 1){
dat - tmp
} else {
dat - rbind(dat, tmp)
}
cat('Grabbing data from page', i, '\n')
}

Thanks,
Harold

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readHTMLTable() in XML package

2015-03-02 Thread Hadley Wickham
This somewhat simpler rvest code does the trick for me:

library(rvest)
library(dplyr)

i - 1:10
urls - paste0('http://games.crossfit.com/scores/leaderboard.php?stage=5',
  'sort=0division=1region=0numberperpage=100competition=0frontpage=0',
  'expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1',
  'is_mobile=0page=', i)

results_table - function(url) {
  url %% html %% html_table(fill = TRUE) %% .[[1]]
}

results - lapply(urls, results_table)
out - results %% bind_rows()

Hadley

On Mon, Mar 2, 2015 at 10:00 AM, Doran, Harold hdo...@air.org wrote:
 I'm having trouble pulling down data from a website with my code below as I 
 keep encountering the same error, but the error occurs on different pages.

 My code below loops through a wensite and grabs data from the html table. The 
 error appears on different pages at different times and I'm not sure of the 
 root cause.

 Error in readHTMLTable(readLines(url), which = 1, header = TRUE) :
   error in evaluating the argument 'doc' in selecting a method for function 
 'readHTMLTable': Error in readHTMLTable(readLines(url), which = 1, header = 
 TRUE) :
   error in evaluating the argument 'doc' in selecting a method for function 
 'readHTMLTable':

 library(XML)
 for(i in 1:1000){
 url - 
 paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5sort=0page=',
  i, sep=''), 
 'division=1region=0numberperpage=100competition=0frontpage=0expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1=is_mobile=0',
  sep='')
 tmp - readHTMLTable(readLines(url), which=1, header=TRUE)
 names(tmp) - gsub(\\n, , names(tmp))
 names(tmp) - gsub( +, , names(tmp))
 tmp[] - lapply(tmp, function(x) gsub(\\n, , x))

 if(i == 1){
 dat - tmp
 } else {
 dat - rbind(dat, tmp)
 }
 cat('Grabbing data from page', i, '\n')
 }

 Thanks,
 Harold

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
http://had.co.nz/

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.