Re: [R] readHTMLTable() in XML package
Hadley Thanks. I ran into the same roadblock when I use your code below by increasing i to loop over all pages. I think the problem is related to the fact that the website I'm scraping is getting hammered with users and the error is just related to a timeout. I have provisionally solved my problem by wrapping in some try() statements in appropriate places and some conditional if/else statements to skip over steps if a timeout occurs. Not sure if this is elegant, but my sledgehammer approach is working now. -Original Message- From: Hadley Wickham [mailto:h.wick...@gmail.com] Sent: Monday, March 02, 2015 2:05 PM To: Doran, Harold Cc: r-help@r-project.org Subject: Re: [R] readHTMLTable() in XML package This somewhat simpler rvest code does the trick for me: library(rvest) library(dplyr) i - 1:10 urls - paste0('http://games.crossfit.com/scores/leaderboard.php?stage=5', 'sort=0division=1region=0numberperpage=100competition=0frontpage=0', 'expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1', 'is_mobile=0page=', i) results_table - function(url) { url %% html %% html_table(fill = TRUE) %% .[[1]] } results - lapply(urls, results_table) out - results %% bind_rows() Hadley On Mon, Mar 2, 2015 at 10:00 AM, Doran, Harold hdo...@air.org wrote: I'm having trouble pulling down data from a website with my code below as I keep encountering the same error, but the error occurs on different pages. My code below loops through a wensite and grabs data from the html table. The error appears on different pages at different times and I'm not sure of the root cause. Error in readHTMLTable(readLines(url), which = 1, header = TRUE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': Error in readHTMLTable(readLines(url), which = 1, header = TRUE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': library(XML) for(i in 1:1000){ url - paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5sort=0page=', i, sep=''), 'division=1region=0numberperpage=100competition=0frontpage=0expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1=is_mobile=0', sep='') tmp - readHTMLTable(readLines(url), which=1, header=TRUE) names(tmp) - gsub(\\n, , names(tmp)) names(tmp) - gsub( +, , names(tmp)) tmp[] - lapply(tmp, function(x) gsub(\\n, , x)) if(i == 1){ dat - tmp } else { dat - rbind(dat, tmp) } cat('Grabbing data from page', i, '\n') } Thanks, Harold [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- http://had.co.nz/ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] readHTMLTable() in XML package
I'm having trouble pulling down data from a website with my code below as I keep encountering the same error, but the error occurs on different pages. My code below loops through a wensite and grabs data from the html table. The error appears on different pages at different times and I'm not sure of the root cause. Error in readHTMLTable(readLines(url), which = 1, header = TRUE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': Error in readHTMLTable(readLines(url), which = 1, header = TRUE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': library(XML) for(i in 1:1000){ url - paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5sort=0page=', i, sep=''), 'division=1region=0numberperpage=100competition=0frontpage=0expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1=is_mobile=0', sep='') tmp - readHTMLTable(readLines(url), which=1, header=TRUE) names(tmp) - gsub(\\n, , names(tmp)) names(tmp) - gsub( +, , names(tmp)) tmp[] - lapply(tmp, function(x) gsub(\\n, , x)) if(i == 1){ dat - tmp } else { dat - rbind(dat, tmp) } cat('Grabbing data from page', i, '\n') } Thanks, Harold [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] readHTMLTable() in XML package
This somewhat simpler rvest code does the trick for me: library(rvest) library(dplyr) i - 1:10 urls - paste0('http://games.crossfit.com/scores/leaderboard.php?stage=5', 'sort=0division=1region=0numberperpage=100competition=0frontpage=0', 'expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1', 'is_mobile=0page=', i) results_table - function(url) { url %% html %% html_table(fill = TRUE) %% .[[1]] } results - lapply(urls, results_table) out - results %% bind_rows() Hadley On Mon, Mar 2, 2015 at 10:00 AM, Doran, Harold hdo...@air.org wrote: I'm having trouble pulling down data from a website with my code below as I keep encountering the same error, but the error occurs on different pages. My code below loops through a wensite and grabs data from the html table. The error appears on different pages at different times and I'm not sure of the root cause. Error in readHTMLTable(readLines(url), which = 1, header = TRUE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': Error in readHTMLTable(readLines(url), which = 1, header = TRUE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': library(XML) for(i in 1:1000){ url - paste(paste('http://games.crossfit.com/scores/leaderboard.php?stage=5sort=0page=', i, sep=''), 'division=1region=0numberperpage=100competition=0frontpage=0expanded=1year=15full=1showtoggles=0hidedropdowns=0showathleteac=1=is_mobile=0', sep='') tmp - readHTMLTable(readLines(url), which=1, header=TRUE) names(tmp) - gsub(\\n, , names(tmp)) names(tmp) - gsub( +, , names(tmp)) tmp[] - lapply(tmp, function(x) gsub(\\n, , x)) if(i == 1){ dat - tmp } else { dat - rbind(dat, tmp) } cat('Grabbing data from page', i, '\n') } Thanks, Harold [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- http://had.co.nz/ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.