Re: [R] web scraping tables generated in multiple server pages

2016-05-11 Thread boB Rudis
I upgraded ffox to the 46-series and intermittently received the same
error. But by adding a `Sys.sleep(1)` to the final `if`:

  if ((i %% 10) == 0) {
ref <- remDr$findElements("xpath", ".//a[.='...']")
ref[[length(ref)]]$clickElement()
Sys.sleep(1)
  }

I was able to reproduce my original, successful outcome. I think it
has something to do with the page not being fully loaded when the the
driver tries to get the page content. Go multithreading! My choice of
1s was arbitrary. Longer == better chance of it working more often.

This 

would probably also be better (waiting for a full page load signal),
but I try to not use [R]Selenium at all if it can be helped.

-Bob



On Wed, May 11, 2016 at 2:00 PM, boB Rudis  wrote:
> Hey David,
>
> I'm on a Mac as well but have never had to tweak anything to get
> [R]Selenium to work (but this is one reason I try to avoid solutions
> involving RSelenium as they are pretty fragile IMO).
>
> The site itself has "Página 1 de 69" at the top which is where i got
> the "69" from and I just re-ran the code in a 100% clean env (on a
> completely different Mac) and it worked fine.
>
> I did neglect to put my session info up before (apologies):
>
> Session info
> 
>  setting  value
>  version  R version 3.3.0 RC (2016-05-01 r70572)
>  system   x86_64, darwin13.4.0
>  ui   RStudio (0.99.1172)
>  language (EN)
>  collate  en_US.UTF-8
>  tz   America/New_York
>  date 2016-05-11
>
> Packages 
> 
>  package* version  date   source
>  assertthat   0.1  2013-12-06 CRAN (R 3.3.0)
>  bitops * 1.0-62013-08-17 CRAN (R 3.3.0)
>  caTools  1.17.1   2014-09-10 CRAN (R 3.3.0)
>  DBI  0.4  2016-05-02 CRAN (R 3.3.0)
>  devtools   * 1.11.1   2016-04-21 CRAN (R 3.3.0)
>  digest   0.6.92016-01-08 CRAN (R 3.3.0)
>  dplyr  * 0.4.32015-09-01 CRAN (R 3.3.0)
>  httr 1.1.02016-01-28 CRAN (R 3.3.0)
>  magrittr 1.5  2014-11-22 CRAN (R 3.3.0)
>  memoise  1.0.02016-01-29 CRAN (R 3.3.0)
>  pbapply* 1.2-12016-04-19 CRAN (R 3.3.0)
>  R6   2.1.22016-01-26 CRAN (R 3.3.0)
>  Rcpp 0.12.4   2016-03-26 CRAN (R 3.3.0)
>  RCurl  * 1.95-4.8 2016-03-01 CRAN (R 3.3.0)
>  RJSONIO* 1.3-02014-07-28 CRAN (R 3.3.0)
>  RSelenium  * 1.3.52014-10-26 CRAN (R 3.3.0)
>  rvest  * 0.3.12015-11-11 CRAN (R 3.3.0)
>  selectr  0.2-32014-12-24 CRAN (R 3.3.0)
>  stringi  1.0-12015-10-22 CRAN (R 3.3.0)
>  stringr  1.0.02015-04-30 CRAN (R 3.3.0)
>  withr1.0.12016-02-04 CRAN (R 3.3.0)
>  XML* 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
>  xml2   * 0.1.22015-09-01 CRAN (R 3.3.0)
>
> (and, wow, does that tiny snippet of code end up using alot of pkgs)
>
> I had actually started with smaller snippets to test. The code got
> uglier due to the way the site paginates (it loads 10-entries worth of
> data on to a single page but requires a server call for the next 10).
>
> I also keep firefox scarily out-of-date (back in the 33's rev) b/c I
> only use it with RSelenium (not a big fan of the browser). Let me
> update to the 46-series and see if I can replicate.
>
> -Bob
>
> On Wed, May 11, 2016 at 1:48 PM, David Winsemius  
> wrote:
>>
>>> On May 10, 2016, at 1:11 PM, boB Rudis  wrote:
>>>
>>> Unfortunately, it's a wretched, vile, SharePoint-based site. That
>>> means it doesn't use traditional encoding methods to do the pagination
>>> and one of the only ways to do this effectively is going to be to use
>>> RSelenium:
>>>
>>>library(RSelenium)
>>>library(rvest)
>>>library(dplyr)
>>>library(pbapply)
>>>
>>>URL <- 
>>> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
>>>
>>>checkForServer()
>>>startServer()
>>>remDr <- remoteDriver$new()
>>>remDr$open()
>>
>> Thanks Bob/hrbrmstr;
>>
>> At this point I got an error:
>>
>>>startServer()
>>>remDr <- remoteDriver$new()
>>>remDr$open()
>> [1] "Connecting to remote server"
>> Undefined error in RCurl call.Error in queryRD(paste0(serverURL, 
>> "/session"), "POST", qdata = toJSON(serverOpts)) :
>>
>> Running R 3.0.0 on a Mac (El Cap) in the R.app GUI.
>> $ java -version
>> java version "1.8.0_65"
>> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
>>
>> I asked myself: What additional information is needed to debug this? But 
>> then I thought I had a responsibility to search for earlier reports of this 
>> 

Re: [R] web scraping tables generated in multiple server pages

2016-05-11 Thread boB Rudis
Hey David,

I'm on a Mac as well but have never had to tweak anything to get
[R]Selenium to work (but this is one reason I try to avoid solutions
involving RSelenium as they are pretty fragile IMO).

The site itself has "Página 1 de 69" at the top which is where i got
the "69" from and I just re-ran the code in a 100% clean env (on a
completely different Mac) and it worked fine.

I did neglect to put my session info up before (apologies):

Session info

 setting  value
 version  R version 3.3.0 RC (2016-05-01 r70572)
 system   x86_64, darwin13.4.0
 ui   RStudio (0.99.1172)
 language (EN)
 collate  en_US.UTF-8
 tz   America/New_York
 date 2016-05-11

Packages 

 package* version  date   source
 assertthat   0.1  2013-12-06 CRAN (R 3.3.0)
 bitops * 1.0-62013-08-17 CRAN (R 3.3.0)
 caTools  1.17.1   2014-09-10 CRAN (R 3.3.0)
 DBI  0.4  2016-05-02 CRAN (R 3.3.0)
 devtools   * 1.11.1   2016-04-21 CRAN (R 3.3.0)
 digest   0.6.92016-01-08 CRAN (R 3.3.0)
 dplyr  * 0.4.32015-09-01 CRAN (R 3.3.0)
 httr 1.1.02016-01-28 CRAN (R 3.3.0)
 magrittr 1.5  2014-11-22 CRAN (R 3.3.0)
 memoise  1.0.02016-01-29 CRAN (R 3.3.0)
 pbapply* 1.2-12016-04-19 CRAN (R 3.3.0)
 R6   2.1.22016-01-26 CRAN (R 3.3.0)
 Rcpp 0.12.4   2016-03-26 CRAN (R 3.3.0)
 RCurl  * 1.95-4.8 2016-03-01 CRAN (R 3.3.0)
 RJSONIO* 1.3-02014-07-28 CRAN (R 3.3.0)
 RSelenium  * 1.3.52014-10-26 CRAN (R 3.3.0)
 rvest  * 0.3.12015-11-11 CRAN (R 3.3.0)
 selectr  0.2-32014-12-24 CRAN (R 3.3.0)
 stringi  1.0-12015-10-22 CRAN (R 3.3.0)
 stringr  1.0.02015-04-30 CRAN (R 3.3.0)
 withr1.0.12016-02-04 CRAN (R 3.3.0)
 XML* 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
 xml2   * 0.1.22015-09-01 CRAN (R 3.3.0)

(and, wow, does that tiny snippet of code end up using alot of pkgs)

I had actually started with smaller snippets to test. The code got
uglier due to the way the site paginates (it loads 10-entries worth of
data on to a single page but requires a server call for the next 10).

I also keep firefox scarily out-of-date (back in the 33's rev) b/c I
only use it with RSelenium (not a big fan of the browser). Let me
update to the 46-series and see if I can replicate.

-Bob

On Wed, May 11, 2016 at 1:48 PM, David Winsemius  wrote:
>
>> On May 10, 2016, at 1:11 PM, boB Rudis  wrote:
>>
>> Unfortunately, it's a wretched, vile, SharePoint-based site. That
>> means it doesn't use traditional encoding methods to do the pagination
>> and one of the only ways to do this effectively is going to be to use
>> RSelenium:
>>
>>library(RSelenium)
>>library(rvest)
>>library(dplyr)
>>library(pbapply)
>>
>>URL <- 
>> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
>>
>>checkForServer()
>>startServer()
>>remDr <- remoteDriver$new()
>>remDr$open()
>
> Thanks Bob/hrbrmstr;
>
> At this point I got an error:
>
>>startServer()
>>remDr <- remoteDriver$new()
>>remDr$open()
> [1] "Connecting to remote server"
> Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), 
> "POST", qdata = toJSON(serverOpts)) :
>
> Running R 3.0.0 on a Mac (El Cap) in the R.app GUI.
> $ java -version
> java version "1.8.0_65"
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
>
> I asked myself: What additional information is needed to debug this? But then 
> I thought I had a responsibility to search for earlier reports of this error 
> on a Mac, and there were many. After reading this thread: 
> https://github.com/ropensci/RSelenium/issues/54  I decided to try creating an 
> "alias", mac-speak for a symlink, and put that symlink in my working 
> directory (with no further chmod security efforts). I restarted R and re-ran 
> the code which opened a Firefox browser window and then proceeded to page 
> through many pages. Eventually, however it errors out with this message:
>
>>pblapply(1:69, function(i) {
> +
> +  if (i %in% seq(1, 69, 10)) {
> +pg <- read_html(remDr$getPageSource()[[1]])
> +ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> +
> +  } else {
> +ref <- remDr$findElements("xpath",
> + sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
> + i))
> +ref[[1]]$clickElement()
> +pg <- read_html(remDr$getPageSource()[[1]])
> +ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> +
> +  }
> +  if ((i %% 10) == 0) {

Re: [R] web scraping tables generated in multiple server pages

2016-05-11 Thread David Winsemius

> On May 10, 2016, at 1:11 PM, boB Rudis  wrote:
> 
> Unfortunately, it's a wretched, vile, SharePoint-based site. That
> means it doesn't use traditional encoding methods to do the pagination
> and one of the only ways to do this effectively is going to be to use
> RSelenium:
> 
>library(RSelenium)
>library(rvest)
>library(dplyr)
>library(pbapply)
> 
>URL <- 
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
> 
>checkForServer()
>startServer()
>remDr <- remoteDriver$new()
>remDr$open()

Thanks Bob/hrbrmstr;

At this point I got an error:

>startServer()
>remDr <- remoteDriver$new()
>remDr$open()
[1] "Connecting to remote server"
Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), 
"POST", qdata = toJSON(serverOpts)) : 

Running R 3.0.0 on a Mac (El Cap) in the R.app GUI. 
$ java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

I asked myself: What additional information is needed to debug this? But then I 
thought I had a responsibility to search for earlier reports of this error on a 
Mac, and there were many. After reading this thread: 
https://github.com/ropensci/RSelenium/issues/54  I decided to try creating an 
"alias", mac-speak for a symlink, and put that symlink in my working directory 
(with no further chmod security efforts). I restarted R and re-ran the code 
which opened a Firefox browser window and then proceeded to page through many 
pages. Eventually, however it errors out with this message:

>pblapply(1:69, function(i) {
+ 
+  if (i %in% seq(1, 69, 10)) {
+pg <- read_html(remDr$getPageSource()[[1]])
+ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
+ 
+  } else {
+ref <- remDr$findElements("xpath",
+ sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
+ i))
+ref[[1]]$clickElement()
+pg <- read_html(remDr$getPageSource()[[1]])
+ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
+ 
+  }
+  if ((i %% 10) == 0) {
+ref <- remDr$findElements("xpath", ".//a[.='...']")
+ref[[length(ref)]]$clickElement()
+  }
+ 
+  ret
+ 
+}) -> tabs
   |+++   | 22% ~54s  Error 
in html_nodes(pg, "table")[[3]] : subscript out of bounds
> 
>final_dat <- bind_rows(tabs)
Error in bind_rows(tabs) : object 'tabs' not found


There doesn't seem to be any trace of objects from all the downloading efforts 
that I could find. When I changed both instances of '69' to '30' it no longer 
errors out. Is there supposed to be an initial step of finding out how many 
pages are actually there befor setting the two iteration limits? I'm wondering 
if that code could be modified to return some intermediate values that would be 
amenable to further assembly efforts in the event of errors?

Sincerely;
David.


>remDr$navigate(URL)
> 
>pblapply(1:69, function(i) {
> 
>  if (i %in% seq(1, 69, 10)) {
> 
># the first item on the page is not a link but we can just grab the 
> page
> 
>pg <- read_html(remDr$getPageSource()[[1]])
>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> 
>  } else {
> 
># we can get the rest of them by the link text directly
> 
>ref <- remDr$findElements("xpath",
> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
> i))
>ref[[1]]$clickElement()
>pg <- read_html(remDr$getPageSource()[[1]])
>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> 
>  }
> 
>  # we have to move to the next actual page of data after every 10 links
> 
>  if ((i %% 10) == 0) {
>ref <- remDr$findElements("xpath", ".//a[.='...']")
>ref[[length(ref)]]$clickElement()
>  }
> 
>  ret
> 
>}) -> tabs
> 
>final_dat <- bind_rows(tabs)
>final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
>final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
> 
>remDr$quit()
> 
> 
> Prbly good ref code to have around, but you can grab the data & code
> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
> 
> (anything to help a fellow parent out :-)
> 
> -Bob
> 
> On Tue, May 10, 2016 at 2:45 PM, Michael Friendly  wrote:
>> This is my first attempt to try R web scraping tools, for a project my
>> daughter is working on.  It concerns a data base of projects in Sao
>> Paulo, Brazil, listed at
>> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
>> but spread out over 69 pages accessed through a javascript menu at the
>> bottom of the page.
>> 
>> Each web page contains 3 HTML tables, of which only the last contains
>> the relevant data.  In this, only a subset of 

Re: [R] web scraping tables generated in multiple server pages / Best of R-help

2016-05-11 Thread Michael Friendly
On 5/10/2016 4:11 PM, boB Rudis wrote:
> Unfortunately, it's a wretched, vile, SharePoint-based site. That
> means it doesn't use traditional encoding methods to do the pagination
> and one of the only ways to do this effectively is going to be to use
> RSelenium:
>
R-help is not stack exchange, where people get "reputation" points for 
good answers,
and R-help often sees a lot of unhelpful and sometimes unkind answers.
So, when someone is exceptionally helpful, it is worthwhile 
acknowledging it
in public, as I do now, with my "Best of R-help" award to Bob Rudis.

Not only did he point me to RSelenium, but he wrote a complete solution
to the problem, and gave me the generated data on a github link.
It was slick, and I learned a lot from it.

best,
-Michael

-- 
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University  Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] web scraping tables generated in multiple server pages

2016-05-10 Thread boB Rudis
Unfortunately, it's a wretched, vile, SharePoint-based site. That
means it doesn't use traditional encoding methods to do the pagination
and one of the only ways to do this effectively is going to be to use
RSelenium:

library(RSelenium)
library(rvest)
library(dplyr)
library(pbapply)

URL <- 
"http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;

checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()

remDr$navigate(URL)

pblapply(1:69, function(i) {

  if (i %in% seq(1, 69, 10)) {

# the first item on the page is not a link but we can just grab the page

pg <- read_html(remDr$getPageSource()[[1]])
ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)

  } else {

# we can get the rest of them by the link text directly

ref <- remDr$findElements("xpath",
sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
i))
ref[[1]]$clickElement()
pg <- read_html(remDr$getPageSource()[[1]])
ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)

  }

  # we have to move to the next actual page of data after every 10 links

  if ((i %% 10) == 0) {
ref <- remDr$findElements("xpath", ".//a[.='...']")
ref[[length(ref)]]$clickElement()
  }

  ret

}) -> tabs

final_dat <- bind_rows(tabs)
final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs

remDr$quit()


Prbly good ref code to have around, but you can grab the data & code
here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98

(anything to help a fellow parent out :-)

-Bob

On Tue, May 10, 2016 at 2:45 PM, Michael Friendly  wrote:
> This is my first attempt to try R web scraping tools, for a project my
> daughter is working on.  It concerns a data base of projects in Sao
> Paulo, Brazil, listed at
> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
> but spread out over 69 pages accessed through a javascript menu at the
> bottom of the page.
>
> Each web page contains 3 HTML tables, of which only the last contains
> the relevant data.  In this, only a subset of columns are of interest.
> I tried using the XML package as illustrated on several tutorial pages,
> as shown below.  I have no idea how to automate this to extract these
> tables from multiple web pages.  Is there some other package better
> suited to this task?  Can someone help me solve this and other issues?
>
> # Goal: read the data tables contained on 69 pages generated by the link
> below, where
> # each page is generated by a javascript link in the menu of the bottom
> of the page.
> #
> # Each "page" contains 3 html tables, with names "Table 1", "Table 2",
> and the only one
> # of interest with the data, "grdRelSitGeralProcessos"
> #
> # From each such table, extract the following columns:
> #- Processo
> #- Endereço
> #- Distrito
> #- Area terreno (m2)
> #- Valor contrapartida ($)
> #- Area excedente (m2)
>
> # NB: All of the numeric fields use "." as comma-separator and "," as
> the decimal separator,
> #   but because of this are read in as character
>
>
> library(XML)
> link <-
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
>
> saopaulo <- htmlParse(link)
> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
> length(saopaulo.tables)
>
> # its the third table on this page we want
> sp.tab <- saopaulo.tables[[3]]
>
> # columns wanted
> wanted <- c(1, 2, 5, 7, 8, 13, 14)
> head(sp.tab[, wanted])
>
>  > head(sp.tab[, wanted])
>Proposta Processo EndereçoDistrito
> 11 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO
> VAN CLEVEVILA ANDRADE
> 22 2003-0.129.667-3  AV. DR. JOSÉ HIGINO,
> 200 E 216   AGUA RASA
> 33 2003-0.065.011-2   R. ALIANÇA LIBERAL,
> 980 E 990 VILA LEOPOLDINA
> 44 2003-0.165.806-0   R. ALIANÇA LIBERAL,
> 880 E 886 VILA LEOPOLDINA
> 55 2003-0.139.053-0R. DR. JOSÉ DE ANDRADE
> FIGUEIRA, 111VILA ANDRADE
> 66 2003-0.200.692-0R. JOSÉ DE
> JESUS, 66  VILA SONIA
>Ã rea Terreno (m2) Ã rea Excedente (m2) Valor Contrapartida (R$)
> 1   0,00 1.551,14 127.875,98
> 2   0,00 3.552,13 267.075,77
> 3   0,00   624,99 70.212,93
> 4   0,00   395,64 44.447,18
> 5   0,00   719,68 41.764,46
> 6   0,00   446,52 85.152,92
>
> thanks,
>
>
> --
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept. & Chair, Quantitative Methods
> York University  Voice: 416 736-2100 x66249 

Re: [R] web scraping tables generated in multiple server pages

2016-05-10 Thread Marco Silva
Excerpts from Michael Friendly's message of 2016-05-10 14:45:28 -0400:
> This is my first attempt to try R web scraping tools, for a project my 
> daughter is working on.  It concerns a data base of projects in Sao 
> Paulo, Brazil, listed at 
> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
>  
> but spread out over 69 pages accessed through a javascript menu at the 
> bottom of the page.
> 
> Each web page contains 3 HTML tables, of which only the last contains 
> the relevant data.  In this, only a subset of columns are of interest.  
> I tried using the XML package as illustrated on several tutorial pages, 
> as shown below.  I have no idea how to automate this to extract these 
> tables from multiple web pages.  Is there some other package better 
> suited to this task?  Can someone help me solve this and other issues?
> 
> # Goal: read the data tables contained on 69 pages generated by the link 
> below, where
> # each page is generated by a javascript link in the menu of the bottom 
> of the page.
> #
> # Each "page" contains 3 html tables, with names "Table 1", "Table 2", 
> and the only one
> # of interest with the data, "grdRelSitGeralProcessos"
> #
> # From each such table, extract the following columns:
> #- Processo
> #- Endereço
> #- Distrito
> #- Area terreno (m2)
> #- Valor contrapartida ($)
> #- Area excedente (m2)
> 
> # NB: All of the numeric fields use "." as comma-separator and "," as 
> the decimal separator,
> #   but because of this are read in as character
> 
> 
> library(XML)
> link <- 
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
> 
> saopaulo <- htmlParse(link)
> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
> length(saopaulo.tables)
> 
> # its the third table on this page we want
> sp.tab <- saopaulo.tables[[3]]
> 
> # columns wanted
> wanted <- c(1, 2, 5, 7, 8, 13, 14)
> head(sp.tab[, wanted])
> 
>  > head(sp.tab[, wanted])
>Proposta Processo EndereçoDistrito
> 11 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO 
> VAN CLEVEVILA ANDRADE
> 22 2003-0.129.667-3  AV. DR. JOSÉ HIGINO, 
> 200 E 216   AGUA RASA
> 33 2003-0.065.011-2   R. ALIANÇA LIBERAL, 
> 980 E 990 VILA LEOPOLDINA
> 44 2003-0.165.806-0   R. ALIANÇA LIBERAL, 
> 880 E 886 VILA LEOPOLDINA
> 55 2003-0.139.053-0R. DR. JOSÉ DE ANDRADE 
> FIGUEIRA, 111VILA ANDRADE
> 66 2003-0.200.692-0R. JOSÉ DE 
> JESUS, 66  VILA SONIA
>Área Terreno (m2) Área Excedente (m2) Valor Contrapartida (R$)
> 1   0,00 1.551,14 127.875,98
> 2   0,00 3.552,13 267.075,77
> 3   0,00   624,99 70.212,93
> 4   0,00   395,64 44.447,18
> 5   0,00   719,68 41.764,46
> 6   0,00   446,52 85.152,92
> 
> thanks,
> 
> 
> -- 
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept. & Chair, Quantitative Methods
> York University  Voice: 416 736-2100 x66249 Fax: 416 736-5814
> 4700 Keele StreetWeb:http://www.datavis.ca
> Toronto, ONT  M3J 1P3 CANADA
> 
> 
# what is missing to you
?gsub
# aliasing
df <- sp.tab[, wanted]

# convert to double
as.double(  # convert to double
gsub(',', '.',  # makes the ',' to become '.'
gsub('\\.', '', df$"Área Excedente (m2)"))  # get rid of the dot

You can easily put the names of the columns and use lapply on them to
convert all of them in same manner, that is left as an exercise.


-- 
Marco Arthur @ (M)arco Creatives

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] web scraping tables generated in multiple server pages

2016-05-10 Thread Michael Friendly
This is my first attempt to try R web scraping tools, for a project my 
daughter is working on.  It concerns a data base of projects in Sao 
Paulo, Brazil, listed at 
http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
 
but spread out over 69 pages accessed through a javascript menu at the 
bottom of the page.

Each web page contains 3 HTML tables, of which only the last contains 
the relevant data.  In this, only a subset of columns are of interest.  
I tried using the XML package as illustrated on several tutorial pages, 
as shown below.  I have no idea how to automate this to extract these 
tables from multiple web pages.  Is there some other package better 
suited to this task?  Can someone help me solve this and other issues?

# Goal: read the data tables contained on 69 pages generated by the link 
below, where
# each page is generated by a javascript link in the menu of the bottom 
of the page.
#
# Each "page" contains 3 html tables, with names "Table 1", "Table 2", 
and the only one
# of interest with the data, "grdRelSitGeralProcessos"
#
# From each such table, extract the following columns:
#- Processo
#- Endereço
#- Distrito
#- Area terreno (m2)
#- Valor contrapartida ($)
#- Area excedente (m2)

# NB: All of the numeric fields use "." as comma-separator and "," as 
the decimal separator,
#   but because of this are read in as character


library(XML)
link <- 
"http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;

saopaulo <- htmlParse(link)
saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
length(saopaulo.tables)

# its the third table on this page we want
sp.tab <- saopaulo.tables[[3]]

# columns wanted
wanted <- c(1, 2, 5, 7, 8, 13, 14)
head(sp.tab[, wanted])

 > head(sp.tab[, wanted])
   Proposta Processo EndereçoDistrito
11 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO 
VAN CLEVEVILA ANDRADE
22 2003-0.129.667-3  AV. DR. JOSÉ HIGINO, 
200 E 216   AGUA RASA
33 2003-0.065.011-2   R. ALIANÇA LIBERAL, 
980 E 990 VILA LEOPOLDINA
44 2003-0.165.806-0   R. ALIANÇA LIBERAL, 
880 E 886 VILA LEOPOLDINA
55 2003-0.139.053-0R. DR. JOSÉ DE ANDRADE 
FIGUEIRA, 111VILA ANDRADE
66 2003-0.200.692-0R. JOSÉ DE 
JESUS, 66  VILA SONIA
   Área Terreno (m2) Área Excedente (m2) Valor Contrapartida (R$)
1   0,00 1.551,14 127.875,98
2   0,00 3.552,13 267.075,77
3   0,00   624,99 70.212,93
4   0,00   395,64 44.447,18
5   0,00   719,68 41.764,46
6   0,00   446,52 85.152,92

thanks,


-- 
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University  Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.