Re: [R] Web-scraping newbie - dynamic table into R?

2020-04-21 Thread Ivan Krylov
On Sun, 19 Apr 2020 at 22:34, Julio Farach  wrote:

> But, I'm seeking the last 10 draws shown on the "Winning Numbers," or
> 4th tab.

The "Network" tab in browser developer tools (usually accessible by
pressing F12) demonstrates that the "Winning Numbers" are fetched in
JSON format by means of an XHR from
.

The server checks the User-Agent: header and returns a 403 error to
clients that don't look like browsers, which probably means that the
website ToS forbids programmatic access.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Web-scraping newbie - dynamic table into R?

2020-04-21 Thread John Kane
Hi Julio,

I am just working on my first cup of tea of the morning so I am not
functioning all that well but I finally noticed that we have dropped the
R-help list.  I have put it back as a recipient as there are a lot of
people that know about 99%+ more than I do about the topic.

I'll keep poking around and see what I can find.

On Sun, 19 Apr 2020 at 22:34, Julio Farach  wrote:

> John,
>
> I again thank you for the reply and continued support.  After a few hours,
> I arrived at the point you describe below; namely extracting elements, but
> from a different tab than the Last 10 Draws, or Winning Numbers tab.
>
> On the website, there are 5 tabs.  The elements you describe below are
> from the 3rd tab, "Odds & Prizes."  Instead of results, that tab describes
> the general odds of the Keno game.  But, I'm seeking the last 10 draws
> shown on the "Winning Numbers," or 4th tab.  I've played around with a CSS
> Selector tool, but I'm unable to extract any details (e.g., a draw number
> or Keno number) from the 4th tab.  I could extract elements of other tabs,
> like you did below, from the 3rd tab.
>
> Please let me know if you learn more or if you have other ideas for me to
> consider.
>
> Regards,
> Julio
>
> On Sun, Apr 19, 2020 at 7:00 PM John Kane  wrote:
>
>> I am a comple newbie too but try this
>> library(rvest)
>>Kenopage <- "
>> https://www.galottery.com/en-us/games/draw-games/keno.html#tab-winningNumbers
>> "
>>
>> Keno <- read_html(Kenopage)
>>
>> tt  <-  html_table(Keno, fill= TRUE)
>>
>> This should give you a list with 10 elements, each of which should be a
>> data.frame
>> Example
>>
>> ken1  <-  tt[[1]]
>> str(ken1)
>>
>> > str(ken1)
>> 'data.frame': 12 obs. of  4 variables:
>>  $ Numbers Matched : chr  "10" "9" "8" "7" ...
>>  $ Base Keno! Prize: chr  "$100,000*" "$5,000" "$500" "$50" ...
>>  $ + Bulls-Eye Prize   : chr  "$200,000*" "$20,000" "$1,500" "$100"
>> ...
>>  $ Keno! w/ Bulls-Eye Prize: chr  "$300,000" "$25,000" "$2,000" "$150" ...
>> >
>>
>> I figured this out a little a few ago and just manually stepped through
>> the data.frames to get what I wanted. Brute force and stupidity but it
>> worked
>>
>> Someday I may figure out how to use things like SelectorGadget!
>>
>>
>>
>>
>> On Sun, 19 Apr 2020 at 17:46, Julio Farach  wrote:
>>
>>> John - I corrected my email below for typos.
>>>
>>> On Sun, Apr 19, 2020 at 5:42 PM Julio Farach  wrote:
>>>
 John,

 Yes, while I can execute the line of code that I provided, I am still
 unable to capture the table shown in the browser.  The last 10 draws are
 shown in a table if you view the page:

 https://www.galottery.com/en-us/games/draw-games/keno.html#tab-winningNumbers


 But, despite using CSS and XPath combinations of
 >html_nodes(x, CSS or XPath)
 I am unable to copy that table into R.

 One commenter on another forum received an error and suggested that
 perhaps bots lack permission to access the page.  But, I've used the
 Robotstxt package to ensure that bots are indeed permitted.

 Any thoughts?

 Regards,
 Julio

 On Sun, Apr 19, 2020 at 4:38 PM John Kane  wrote:

> Keno <- read_html(Kenopage) ?
>
> Or Am I misunderstanding the problem?
>
> On Sun, 19 Apr 2020 at 15:10, Julio Farach  wrote:
>
>> How do I scrape the last 10 Keno draws from the Georgia lottery into
>> R?
>>
>>
>> I'm trying to pull the last 10 draws of a Keno lottery game into R.
>> I've
>> read several tutorials on how to scrape websites using the rvest
>> package,
>> Chrome's Inspect Element, and CSS or XPath, but I'm likely stuck
>> because
>> the table I seek is dynamically generated using Javascript.
>>
>>
>>
>> I started with:
>>
>> >install.packages("rvest")
>>
>> >   library(rvest)
>>
>> >Kenopage <- "
>>
>> https://www.galottery.com/en-us/games/draw-games/keno.html#tab-winningNumbers
>> "
>>
>> > Keno <- Read.hmtl(Kenopage)
>>
>> From there, I've been unable to progress, despite hours spend on
>> combinations of CSS and XPath calls with "html_notes."
>>
>> Failed example: DrawNumber <- Keno %>% rvest::html_nodes("body") %>%
>> xml2::xml_find_all("//span[contains(@class,'Draw Number')]") %>%
>> rvest::html_text()
>>
>>
>>
>> Someone mentioned using the V8 package in R, but it's new to me.
>>
>> How do I get started?
>>
>> --
>>
>> Julio Farach
>> https://www.linkedin.com/in/farach
>> cell phone:  804/363-2161
>> email:  jfar...@gmail.com
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide

Re: [R] Web-scraping newbie - dynamic table into R?

2020-04-19 Thread John Kane
Keno <- read_html(Kenopage) ?

Or Am I misunderstanding the problem?

On Sun, 19 Apr 2020 at 15:10, Julio Farach  wrote:

> How do I scrape the last 10 Keno draws from the Georgia lottery into R?
>
>
> I'm trying to pull the last 10 draws of a Keno lottery game into R.  I've
> read several tutorials on how to scrape websites using the rvest package,
> Chrome's Inspect Element, and CSS or XPath, but I'm likely stuck because
> the table I seek is dynamically generated using Javascript.
>
>
>
> I started with:
>
> >install.packages("rvest")
>
> >   library(rvest)
>
> >Kenopage <- "
>
> https://www.galottery.com/en-us/games/draw-games/keno.html#tab-winningNumbers
> "
>
> > Keno <- Read.hmtl(Kenopage)
>
> From there, I've been unable to progress, despite hours spend on
> combinations of CSS and XPath calls with "html_notes."
>
> Failed example: DrawNumber <- Keno %>% rvest::html_nodes("body") %>%
> xml2::xml_find_all("//span[contains(@class,'Draw Number')]") %>%
> rvest::html_text()
>
>
>
> Someone mentioned using the V8 package in R, but it's new to me.
>
> How do I get started?
>
> --
>
> Julio Farach
> https://www.linkedin.com/in/farach
> cell phone:  804/363-2161
> email:  jfar...@gmail.com
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
John Kane
Kingston ON Canada

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Web-scraping newbie - dynamic table into R?

2020-04-19 Thread Jeff Newmiller
Web-scraping is not a common topic here, but one point that does come up is to 
be sure you are conforming with the website terms of use before getting in too 
deep.

Another bit of advice is to look for the underlying API... that is usually more 
performant than scraping anyway. Try using the developer tools in Chrome to 
find out how they are populating the page for clues, or just Google it.

Finally, you might try the RSelenium package. I don't have first hand 
experience with it but it is reputed to be designed to scrape dynamic web pages.

On April 18, 2020 1:50:02 PM PDT, Julio Farach  wrote:
>How do I scrape the last 10 Keno draws from the Georgia lottery into R?
>
>
>I'm trying to pull the last 10 draws of a Keno lottery game into R. 
>I've
>read several tutorials on how to scrape websites using the rvest
>package,
>Chrome's Inspect Element, and CSS or XPath, but I'm likely stuck
>because
>the table I seek is dynamically generated using Javascript.
>
>
>
>I started with:
>
>>install.packages("rvest")
>
>>   library(rvest)
>
>>Kenopage <- "
>https://www.galottery.com/en-us/games/draw-games/keno.html#tab-winningNumbers
>"
>
>> Keno <- Read.hmtl(Kenopage)
>
>From there, I've been unable to progress, despite hours spend on
>combinations of CSS and XPath calls with "html_notes."
>
>Failed example: DrawNumber <- Keno %>% rvest::html_nodes("body") %>%
>xml2::xml_find_all("//span[contains(@class,'Draw Number')]") %>%
>rvest::html_text()
>
>
>
>Someone mentioned using the V8 package in R, but it's new to me.
>
>How do I get started?

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Web-scraping newbie - dynamic table into R?

2020-04-19 Thread Julio Farach
How do I scrape the last 10 Keno draws from the Georgia lottery into R?


I'm trying to pull the last 10 draws of a Keno lottery game into R.  I've
read several tutorials on how to scrape websites using the rvest package,
Chrome's Inspect Element, and CSS or XPath, but I'm likely stuck because
the table I seek is dynamically generated using Javascript.



I started with:

>install.packages("rvest")

>   library(rvest)

>Kenopage <- "
https://www.galottery.com/en-us/games/draw-games/keno.html#tab-winningNumbers
"

> Keno <- Read.hmtl(Kenopage)

>From there, I've been unable to progress, despite hours spend on
combinations of CSS and XPath calls with "html_notes."

Failed example: DrawNumber <- Keno %>% rvest::html_nodes("body") %>%
xml2::xml_find_all("//span[contains(@class,'Draw Number')]") %>%
rvest::html_text()



Someone mentioned using the V8 package in R, but it's new to me.

How do I get started?

-- 

Julio Farach
https://www.linkedin.com/in/farach
cell phone:  804/363-2161
email:  jfar...@gmail.com

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R web-scraping a multiple-level page

2019-04-10 Thread Chris Evans



- Original Message -
> From: "Boris Steipe" 
> To: "Ilio Fornasero" 
> Cc: r-help@r-project.org
> Sent: Wednesday, 10 April, 2019 12:34:15
> Subject: Re: [R] R web-scraping a multiple-level page

[snip]
 
> (2) Restrict the condition with a maximum number of cycles. More often than 
> not
> assumptions about the world turn out to be overly rational.

Brilliant!! Fortune nomination?

And the advice was useful to me too though I'm not the OQ.

Thanks,

Chris

-- 
Chris Evans  Skype: chris-psyctc
Visiting Professor, University of Sheffield 
I do some consultation work for the University of Roehampton 
 and other places but this  
remains my main Email address.
I have "semigrated" to France, see: 
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to 
book to talk, I am trying to keep that to Thursdays and my diary is now 
available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take 
you to my blog which started with earlier joys in France and Spain!

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R web-scraping a multiple-level page

2019-04-10 Thread Boris Steipe
For similar tasks I usually write a while loop operating on a queue. 
Conceptually:

initialize queue with first page
add first url to harvested urls

while queue not empty (2)
  unshift url from queue
  collect valid child pages that are not already in harvested list (1)
  add to harvested list
  add to queue

process all harvested pages



(1) - grep for the base url so you don't leave the site
- use %in% to ensure you are not caught in a cycle

(2) Restrict the condition with a maximum number of cycles. More often than not 
assumptions about the world turn out to be overly rational.

Hope this helps,
B.




> On 2019-04-10, at 04:35, Ilio Fornasero  wrote:
> 
> Hello.
> 
> I am trying to scrape a FAO webpage including multiple links from any of 
> which I would like to collect the "News" part.
> 
> Yet, I have done this:
> 
> fao_base = 'http://www.fao.org'
> fao_second_level = paste0(stem, '/countryprofiles/en/')
> 
> all_children = read_html(fao_second_level) %>%
>  html_nodes(xpath = '//a[contains(@href, "?iso3=")]/@href') %>%
>  html_text %>% paste0(fao_base, .)
> 
> Any suggestion on how to go on? I guess with a loop but I didn't have any 
> success, yet.
> Thanks
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Web scraping different levels of a website

2018-01-22 Thread Ilio Fornasero
Thanks again, David.

I am trying to figure out a way to convert the lists into a data.frame.

Any hint?

The usual ways (do.call, etc) do not seem to work...

Thanks

Ilio


Da: David Jankoski <david.janko...@hellotrip.nl>
Inviato: venerd� 19 gennaio 2018 15:58
A: iliofornas...@hotmail.com; r-help@r-project.org
Oggetto: Re: [R] Web scraping different levels of a website

Hey Ilio,

I revisited the previous code i posted to you and fixed some things.
This should let you collect as many studies as you like, controlled by
the num_studies arg.

If you try the below url in your browser you can see that it returns a
"simpler" version of the link you posted. To get to this you need to
hit F12 to open Developer Tools --> go to Network tab and click on the
first entry in the list --> in the right pane you should see under the
Headers tab the Request URL.

I'm not very knowledgable in sessions/cookies and what nots - but it
might be that you face some further problems. In which case you could
try to do the above on your side and then copy paste that url that you
find there in the below code. I broke the url in smaller chunks for
readability and because its easier to substitute some query
paramaters.

# load libs
library("rvest")
library("httr")
library("glue")
library("magrittr")

# number of studies to pull from catalogue
num_studies <- 42
year_from <- 1890
year_to <- 2017

# build up the url
url <-
  glue(
"http://catalog.ihsn.org/index.php/catalog/;,
IHSN Survey Catalog<http://catalog.ihsn.org/index.php/catalog/>
catalog.ihsn.org
By: Central Statistics Organization - Government of the Islamic Republic of 
Afghanistan, United Nations Children�s Fund


"search?view=s&",
"ps={num_studies}&",
"page=1=_ref==&_r===&",
"from={year_from}&",
"to={year_to}&",
"sort_order=_by=nation&_=1516371984886")

# read in the html
x <-
  url %>%
  GET() %>%
  content()

# option 1 (div with class "survey-row" --> data-url attribute)
x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

# option 2 (studies titles are  within  elems)
# note that this give you some more information like the title ...
x %>%
  html_nodes("h2 a")


greetings,
david

On 18 January 2018 at 12:58, David Jankoski <david.janko...@hellotrip.nl> wrote:
>
> Hey Ilio,
>
> On the main website (the first link that you provided) if you
> right-click on the title of any entry and select Inspect Element from
> the menu, you will notice in the Developer Tools view that opens up
> that the corresponding html looks like this
>
> (example for the same link that you provided)
>
>  data-url="http://catalog.ihsn.org/index.php/catalog/7118; title="View
Afghanistan - Demographic and Health Survey 
2015<http://catalog.ihsn.org/index.php/catalog/7118>
catalog.ihsn.org
Author(s) Central Statistics Organization, Ansari Watt, Kabul, Afghanistan 
Ministry of Public Health, Wazir Akbar Khan, Kabul, Afghanistan The DHS 
Program, ICF ...


> study">
> 
> 
> http://catalog.ihsn.org/index.php/catalog/7118;
Afghanistan - Demographic and Health Survey 
2015<http://catalog.ihsn.org/index.php/catalog/7118>
catalog.ihsn.org
Author(s) Central Statistics Organization, Ansari Watt, Kabul, Afghanistan 
Ministry of Public Health, Wazir Akbar Khan, Kabul, Afghanistan The DHS 
Program, ICF ...


> title="Demographic and Health Survey 2015">
>   Demographic and Health Survey 2015
> 
>   
>
> Notice how the number you are after is contained within the
> "survey-row" div element, in the data-url attribute. Or alternatively
> withing the  elem within the href attribute. It's up to you which
> one you want to grab but the idea would be the same i.e.
>
> 1. read in the html
> 2. select all list-elements by css / xpath
> 3. grab the fwd link
>
> Here is an example using the first option.
>
> url <- 
> "http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=;
IHSN Survey 
Catalog<http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=>
catalog.ihsn.org
By: Central Statistics Organization - Government of the Islamic Republic of 
Afghanistan, United Nations Children�s Fund


>
> x <-
>   url %>%
>   GET() %>%
>   content()
>
> x %>%
>   html_nodes(".survey-row") %>%
>   html_attr("data-url")
>
> hth.
> david




--

David Jankoski

Teerketelsteeg 1
1012TB Amsterdam
www.hellotrip.com<http://www.hellotrip.com>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Web scraping different levels of a website

2018-01-19 Thread David Jankoski
Hey Ilio,

I revisited the previous code i posted to you and fixed some things.
This should let you collect as many studies as you like, controlled by
the num_studies arg.

If you try the below url in your browser you can see that it returns a
"simpler" version of the link you posted. To get to this you need to
hit F12 to open Developer Tools --> go to Network tab and click on the
first entry in the list --> in the right pane you should see under the
Headers tab the Request URL.

I'm not very knowledgable in sessions/cookies and what nots - but it
might be that you face some further problems. In which case you could
try to do the above on your side and then copy paste that url that you
find there in the below code. I broke the url in smaller chunks for
readability and because its easier to substitute some query
paramaters.

# load libs
library("rvest")
library("httr")
library("glue")
library("magrittr")

# number of studies to pull from catalogue
num_studies <- 42
year_from <- 1890
year_to <- 2017

# build up the url
url <-
  glue(
"http://catalog.ihsn.org/index.php/catalog/;,
"search?view=s&",
"ps={num_studies}&",
"page=1=_ref==&_r===&",
"from={year_from}&",
"to={year_to}&",
"sort_order=_by=nation&_=1516371984886")

# read in the html
x <-
  url %>%
  GET() %>%
  content()

# option 1 (div with class "survey-row" --> data-url attribute)
x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

# option 2 (studies titles are  within  elems)
# note that this give you some more information like the title ...
x %>%
  html_nodes("h2 a")


greetings,
david

On 18 January 2018 at 12:58, David Jankoski  wrote:
>
> Hey Ilio,
>
> On the main website (the first link that you provided) if you
> right-click on the title of any entry and select Inspect Element from
> the menu, you will notice in the Developer Tools view that opens up
> that the corresponding html looks like this
>
> (example for the same link that you provided)
>
>  data-url="http://catalog.ihsn.org/index.php/catalog/7118; title="View
> study">
> 
> 
> http://catalog.ihsn.org/index.php/catalog/7118;
> title="Demographic and Health Survey 2015">
>   Demographic and Health Survey 2015
> 
>   
>
> Notice how the number you are after is contained within the
> "survey-row" div element, in the data-url attribute. Or alternatively
> withing the  elem within the href attribute. It's up to you which
> one you want to grab but the idea would be the same i.e.
>
> 1. read in the html
> 2. select all list-elements by css / xpath
> 3. grab the fwd link
>
> Here is an example using the first option.
>
> url <- 
> "http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=;
>
> x <-
>   url %>%
>   GET() %>%
>   content()
>
> x %>%
>   html_nodes(".survey-row") %>%
>   html_attr("data-url")
>
> hth.
> david




-- 

David Jankoski

Teerketelsteeg 1
1012TB Amsterdam
www.hellotrip.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Web scraping different levels of a website

2018-01-18 Thread David Jankoski
Hey Ilio,

On the main website (the first link that you provided) if you
right-click on the title of any entry and select Inspect Element from
the menu, you will notice in the Developer Tools view that opens up
that the corresponding html looks like this

(example for the same link that you provided)

http://catalog.ihsn.org/index.php/catalog/7118; title="View
study">


http://catalog.ihsn.org/index.php/catalog/7118;
title="Demographic and Health Survey 2015">
  Demographic and Health Survey 2015

  

Notice how the number you are after is contained within the
"survey-row" div element, in the data-url attribute. Or alternatively
withing the  elem within the href attribute. It's up to you which
one you want to grab but the idea would be the same i.e.

1. read in the html
2. select all list-elements by css / xpath
3. grab the fwd link

Here is an example using the first option.

url <- 
"http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=;

x <-
  url %>%
  GET() %>%
  content()

x %>%
  html_nodes(".survey-row") %>%
  html_attr("data-url")

hth.
david

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Web scraping different levels of a website

2018-01-18 Thread Ilio Fornasero
I am web scraping a page at

http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=

>From this url, I have built up a dataframe through the following code:

dflist <- map(.x = 1:417, .f = function(x) {
 Sys.sleep(5)
 url <- 
("http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=;)
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)

I have repeated the same code in order to get all the data I was interested in 
and it seems to work perfectly, although is of course a little slow due to the 
Sys.sleep() thing.

My issue has raised once I have tried to scrape the single projects 
descriptions that should be included in the dataframe.

For instance, the first project description is at

http://catalog.ihsn.org/index.php/catalog/7118/study-description

the second project description is at

http://catalog.ihsn.org/index.php/catalog/6606/study-description

and so forth.

My problem is that I can't find a dynamic way to scrape all the projects' pages 
and insert them in the data frame, being the number in the URLs not progressive 
nor at the end of the link.

To make things clearer, this is the structure of the website I am scraping:

1.http://catalog.ihsn.org/index.php/catalog#_r=1890=1=100==_by=nation_order==2017==s=
   1.1.   http://catalog.ihsn.org/index.php/catalog/7118
1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary

I have scraped successfully level 1. but cannot level 1.1.b. 
(study-description) , the one I am interested in, since the dynamic element of 
the URL (in this case: 7118) is not consistent in the website's above 6000 
pages of that level.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Web scraping - Having trouble figuring out how to approach this problem

2017-02-23 Thread Jeff Newmiller
The answer is yes, and does not seem like a big step from where you are now, so 
seeing what you already know how to do (reproducible example, or RE) would help 
focus the assistance. There are quite a few ways to do this kind of thing, and 
what you already know would be clarified with a RE.
-- 
Sent from my phone. Please excuse my brevity.

On February 22, 2017 2:52:55 PM PST, henrique monte 
 wrote:
>Sometimes I need to get some data from the web organizing it into a
>dataframe and waste a lot of time doing it manually. I've been trying
>to
>figure out how to optimize this proccess, and I've tried with some R
>scraping approaches, but couldn't get to do it right and I thought
>there
>could be an easier way to do this, can anyone help me out with this?
>
>Fictional example:
>
>Here's a webpage with countries listed by continents:
>https://simple.wikipedia.org/wiki/List_of_countries_by_continents
>
>Each country name is also a link that leads to another webpage
>(specific of
>each country, e.g. https://simple.wikipedia.org/wiki/Angola).
>
>I would like as a final result to get a data frame with number of
>observations (rows) = number of countries listed and 4 variables
>(colums)
>as ID=Country Name, Continent=Continent it belongs to,
>Language=Official
>language (from the specific webpage of the Countries) and Population =
>most
>recent population count (from the specific webpage of the Countries).
>
>...
>
>The main issue I'm trying to figure out is handling several webpages,
>like,
>would it be possible to scrape from the first link of the problem the
>countries as a list with the links of the countries webpages and then
>create and run a function to run a scraping command in each of those
>links
>from the list to get the specific data I'm looking for?
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Web scraping - Having trouble figuring out how to approach this problem

2017-02-23 Thread henrique monte
Sometimes I need to get some data from the web organizing it into a
dataframe and waste a lot of time doing it manually. I've been trying to
figure out how to optimize this proccess, and I've tried with some R
scraping approaches, but couldn't get to do it right and I thought there
could be an easier way to do this, can anyone help me out with this?

Fictional example:

Here's a webpage with countries listed by continents:
https://simple.wikipedia.org/wiki/List_of_countries_by_continents

Each country name is also a link that leads to another webpage (specific of
each country, e.g. https://simple.wikipedia.org/wiki/Angola).

I would like as a final result to get a data frame with number of
observations (rows) = number of countries listed and 4 variables (colums)
as ID=Country Name, Continent=Continent it belongs to, Language=Official
language (from the specific webpage of the Countries) and Population = most
recent population count (from the specific webpage of the Countries).

...

The main issue I'm trying to figure out is handling several webpages, like,
would it be possible to scrape from the first link of the problem the
countries as a list with the links of the countries webpages and then
create and run a function to run a scraping command in each of those links
from the list to get the specific data I'm looking for?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] web scraping tables generated in multiple server pages

2016-05-11 Thread boB Rudis
to remote server"
>> Undefined error in RCurl call.Error in queryRD(paste0(serverURL, 
>> "/session"), "POST", qdata = toJSON(serverOpts)) :
>>
>> Running R 3.0.0 on a Mac (El Cap) in the R.app GUI.
>> $ java -version
>> java version "1.8.0_65"
>> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
>>
>> I asked myself: What additional information is needed to debug this? But 
>> then I thought I had a responsibility to search for earlier reports of this 
>> error on a Mac, and there were many. After reading this thread: 
>> https://github.com/ropensci/RSelenium/issues/54  I decided to try creating 
>> an "alias", mac-speak for a symlink, and put that symlink in my working 
>> directory (with no further chmod security efforts). I restarted R and re-ran 
>> the code which opened a Firefox browser window and then proceeded to page 
>> through many pages. Eventually, however it errors out with this message:
>>
>>>pblapply(1:69, function(i) {
>> +
>> +  if (i %in% seq(1, 69, 10)) {
>> +pg <- read_html(remDr$getPageSource()[[1]])
>> +ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>> +
>> +  } else {
>> +ref <- remDr$findElements("xpath",
>> + sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
>> + i))
>> +ref[[1]]$clickElement()
>> +pg <- read_html(remDr$getPageSource()[[1]])
>> +ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>> +
>> +  }
>> +  if ((i %% 10) == 0) {
>> +ref <- remDr$findElements("xpath", ".//a[.='...']")
>> +ref[[length(ref)]]$clickElement()
>> +  }
>> +
>> +  ret
>> +
>> +}) -> tabs
>>|+++   | 22% ~54s  
>> Error in html_nodes(pg, "table")[[3]] : subscript out of bounds
>>>
>>>final_dat <- bind_rows(tabs)
>> Error in bind_rows(tabs) : object 'tabs' not found
>>
>>
>> There doesn't seem to be any trace of objects from all the downloading 
>> efforts that I could find. When I changed both instances of '69' to '30' it 
>> no longer errors out. Is there supposed to be an initial step of finding out 
>> how many pages are actually there befor setting the two iteration limits? 
>> I'm wondering if that code could be modified to return some intermediate 
>> values that would be amenable to further assembly efforts in the event of 
>> errors?
>>
>> Sincerely;
>> David.
>>
>>
>>>remDr$navigate(URL)
>>>
>>>pblapply(1:69, function(i) {
>>>
>>>  if (i %in% seq(1, 69, 10)) {
>>>
>>># the first item on the page is not a link but we can just grab the 
>>> page
>>>
>>>pg <- read_html(remDr$getPageSource()[[1]])
>>>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>>>
>>>  } else {
>>>
>>># we can get the rest of them by the link text directly
>>>
>>>ref <- remDr$findElements("xpath",
>>> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
>>> i))
>>>ref[[1]]$clickElement()
>>>pg <- read_html(remDr$getPageSource()[[1]])
>>>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>>>
>>>  }
>>>
>>>  # we have to move to the next actual page of data after every 10 links
>>>
>>>  if ((i %% 10) == 0) {
>>>ref <- remDr$findElements("xpath", ".//a[.='...']")
>>>ref[[length(ref)]]$clickElement()
>>>  }
>>>
>>>  ret
>>>
>>>}) -> tabs
>>>
>>>final_dat <- bind_rows(tabs)
>>>final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
>>>final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
>>>
>>>remDr$quit()
>>>
>>>
>>> Prbly good ref code to have around, but you can grab the data & code
>>> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
>>>
>>> (anything to help a fellow parent out :-)
>>>
>>> -Bob
>>>
>>&g

Re: [R] web scraping tables generated in multiple server pages

2016-05-11 Thread boB Rudis
 } else {
> +ref <- remDr$findElements("xpath",
> + sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
> + i))
> +ref[[1]]$clickElement()
> +pg <- read_html(remDr$getPageSource()[[1]])
> +ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> +
> +  }
> +  if ((i %% 10) == 0) {
> +ref <- remDr$findElements("xpath", ".//a[.='...']")
> +ref[[length(ref)]]$clickElement()
> +  }
> +
> +  ret
> +
> +}) -> tabs
>|+++   | 22% ~54s  
> Error in html_nodes(pg, "table")[[3]] : subscript out of bounds
>>
>>final_dat <- bind_rows(tabs)
> Error in bind_rows(tabs) : object 'tabs' not found
>
>
> There doesn't seem to be any trace of objects from all the downloading 
> efforts that I could find. When I changed both instances of '69' to '30' it 
> no longer errors out. Is there supposed to be an initial step of finding out 
> how many pages are actually there befor setting the two iteration limits? I'm 
> wondering if that code could be modified to return some intermediate values 
> that would be amenable to further assembly efforts in the event of errors?
>
> Sincerely;
> David.
>
>
>>remDr$navigate(URL)
>>
>>pblapply(1:69, function(i) {
>>
>>  if (i %in% seq(1, 69, 10)) {
>>
>># the first item on the page is not a link but we can just grab the 
>> page
>>
>>pg <- read_html(remDr$getPageSource()[[1]])
>>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>>
>>  } else {
>>
>># we can get the rest of them by the link text directly
>>
>>ref <- remDr$findElements("xpath",
>> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
>> i))
>>ref[[1]]$clickElement()
>>pg <- read_html(remDr$getPageSource()[[1]])
>>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>>
>>  }
>>
>>  # we have to move to the next actual page of data after every 10 links
>>
>>  if ((i %% 10) == 0) {
>>ref <- remDr$findElements("xpath", ".//a[.='...']")
>>ref[[length(ref)]]$clickElement()
>>  }
>>
>>  ret
>>
>>}) -> tabs
>>
>>final_dat <- bind_rows(tabs)
>>final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
>>final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
>>
>>remDr$quit()
>>
>>
>> Prbly good ref code to have around, but you can grab the data & code
>> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
>>
>> (anything to help a fellow parent out :-)
>>
>> -Bob
>>
>> On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <frien...@yorku.ca> wrote:
>>> This is my first attempt to try R web scraping tools, for a project my
>>> daughter is working on.  It concerns a data base of projects in Sao
>>> Paulo, Brazil, listed at
>>> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
>>> but spread out over 69 pages accessed through a javascript menu at the
>>> bottom of the page.
>>>
>>> Each web page contains 3 HTML tables, of which only the last contains
>>> the relevant data.  In this, only a subset of columns are of interest.
>>> I tried using the XML package as illustrated on several tutorial pages,
>>> as shown below.  I have no idea how to automate this to extract these
>>> tables from multiple web pages.  Is there some other package better
>>> suited to this task?  Can someone help me solve this and other issues?
>>>
>>> # Goal: read the data tables contained on 69 pages generated by the link
>>> below, where
>>> # each page is generated by a javascript link in the menu of the bottom
>>> of the page.
>>> #
>>> # Each "page" contains 3 html tables, with names "Table 1", "Table 2",
>>> and the only one
>>> # of interest with the data, "grdRelSitGeralProcessos"
>>> #
>>> # From each such table, extract the following columns:
>>> #- Processo
>>> #- Endereço
>>> #- Distrito
>>> #- Area terreno (m2)
>>> #- Valor contrapartida ($)
>>> #- Area exceden

Re: [R] web scraping tables generated in multiple server pages

2016-05-11 Thread David Winsemius

> On May 10, 2016, at 1:11 PM, boB Rudis <b...@rudis.net> wrote:
> 
> Unfortunately, it's a wretched, vile, SharePoint-based site. That
> means it doesn't use traditional encoding methods to do the pagination
> and one of the only ways to do this effectively is going to be to use
> RSelenium:
> 
>library(RSelenium)
>library(rvest)
>library(dplyr)
>library(pbapply)
> 
>URL <- 
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
> 
>checkForServer()
>startServer()
>remDr <- remoteDriver$new()
>remDr$open()

Thanks Bob/hrbrmstr;

At this point I got an error:

>startServer()
>remDr <- remoteDriver$new()
>remDr$open()
[1] "Connecting to remote server"
Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), 
"POST", qdata = toJSON(serverOpts)) : 

Running R 3.0.0 on a Mac (El Cap) in the R.app GUI. 
$ java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

I asked myself: What additional information is needed to debug this? But then I 
thought I had a responsibility to search for earlier reports of this error on a 
Mac, and there were many. After reading this thread: 
https://github.com/ropensci/RSelenium/issues/54  I decided to try creating an 
"alias", mac-speak for a symlink, and put that symlink in my working directory 
(with no further chmod security efforts). I restarted R and re-ran the code 
which opened a Firefox browser window and then proceeded to page through many 
pages. Eventually, however it errors out with this message:

>pblapply(1:69, function(i) {
+ 
+  if (i %in% seq(1, 69, 10)) {
+pg <- read_html(remDr$getPageSource()[[1]])
+ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
+ 
+  } else {
+ref <- remDr$findElements("xpath",
+ sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
+ i))
+ref[[1]]$clickElement()
+pg <- read_html(remDr$getPageSource()[[1]])
+ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
+ 
+  }
+  if ((i %% 10) == 0) {
+ref <- remDr$findElements("xpath", ".//a[.='...']")
+ref[[length(ref)]]$clickElement()
+  }
+ 
+  ret
+ 
+}) -> tabs
   |+++   | 22% ~54s  Error 
in html_nodes(pg, "table")[[3]] : subscript out of bounds
> 
>final_dat <- bind_rows(tabs)
Error in bind_rows(tabs) : object 'tabs' not found


There doesn't seem to be any trace of objects from all the downloading efforts 
that I could find. When I changed both instances of '69' to '30' it no longer 
errors out. Is there supposed to be an initial step of finding out how many 
pages are actually there befor setting the two iteration limits? I'm wondering 
if that code could be modified to return some intermediate values that would be 
amenable to further assembly efforts in the event of errors?

Sincerely;
David.


>remDr$navigate(URL)
> 
>pblapply(1:69, function(i) {
> 
>  if (i %in% seq(1, 69, 10)) {
> 
># the first item on the page is not a link but we can just grab the 
> page
> 
>pg <- read_html(remDr$getPageSource()[[1]])
>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> 
>  } else {
> 
># we can get the rest of them by the link text directly
> 
>ref <- remDr$findElements("xpath",
> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
> i))
>ref[[1]]$clickElement()
>pg <- read_html(remDr$getPageSource()[[1]])
>ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> 
>  }
> 
>  # we have to move to the next actual page of data after every 10 links
> 
>  if ((i %% 10) == 0) {
>ref <- remDr$findElements("xpath", ".//a[.='...']")
>ref[[length(ref)]]$clickElement()
>  }
> 
>  ret
> 
>}) -> tabs
> 
>final_dat <- bind_rows(tabs)
>final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
>final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
> 
>remDr$quit()
> 
> 
> Prbly good ref code to have around, but you can grab the data & code
> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
> 
> (anything to help a fellow parent out :-)
> 
> -Bob
> 
> On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <frien...@yorku.ca&

Re: [R] web scraping tables generated in multiple server pages / Best of R-help

2016-05-11 Thread Michael Friendly
On 5/10/2016 4:11 PM, boB Rudis wrote:
> Unfortunately, it's a wretched, vile, SharePoint-based site. That
> means it doesn't use traditional encoding methods to do the pagination
> and one of the only ways to do this effectively is going to be to use
> RSelenium:
>
R-help is not stack exchange, where people get "reputation" points for 
good answers,
and R-help often sees a lot of unhelpful and sometimes unkind answers.
So, when someone is exceptionally helpful, it is worthwhile 
acknowledging it
in public, as I do now, with my "Best of R-help" award to Bob Rudis.

Not only did he point me to RSelenium, but he wrote a complete solution
to the problem, and gave me the generated data on a github link.
It was slick, and I learned a lot from it.

best,
-Michael

-- 
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University  Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] web scraping tables generated in multiple server pages

2016-05-10 Thread boB Rudis
Unfortunately, it's a wretched, vile, SharePoint-based site. That
means it doesn't use traditional encoding methods to do the pagination
and one of the only ways to do this effectively is going to be to use
RSelenium:

library(RSelenium)
library(rvest)
library(dplyr)
library(pbapply)

URL <- 
"http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;

checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()

remDr$navigate(URL)

pblapply(1:69, function(i) {

  if (i %in% seq(1, 69, 10)) {

# the first item on the page is not a link but we can just grab the page

pg <- read_html(remDr$getPageSource()[[1]])
ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)

  } else {

# we can get the rest of them by the link text directly

ref <- remDr$findElements("xpath",
sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
i))
ref[[1]]$clickElement()
pg <- read_html(remDr$getPageSource()[[1]])
ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)

  }

  # we have to move to the next actual page of data after every 10 links

  if ((i %% 10) == 0) {
ref <- remDr$findElements("xpath", ".//a[.='...']")
ref[[length(ref)]]$clickElement()
  }

  ret

}) -> tabs

final_dat <- bind_rows(tabs)
final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs

remDr$quit()


Prbly good ref code to have around, but you can grab the data & code
here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98

(anything to help a fellow parent out :-)

-Bob

On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <frien...@yorku.ca> wrote:
> This is my first attempt to try R web scraping tools, for a project my
> daughter is working on.  It concerns a data base of projects in Sao
> Paulo, Brazil, listed at
> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
> but spread out over 69 pages accessed through a javascript menu at the
> bottom of the page.
>
> Each web page contains 3 HTML tables, of which only the last contains
> the relevant data.  In this, only a subset of columns are of interest.
> I tried using the XML package as illustrated on several tutorial pages,
> as shown below.  I have no idea how to automate this to extract these
> tables from multiple web pages.  Is there some other package better
> suited to this task?  Can someone help me solve this and other issues?
>
> # Goal: read the data tables contained on 69 pages generated by the link
> below, where
> # each page is generated by a javascript link in the menu of the bottom
> of the page.
> #
> # Each "page" contains 3 html tables, with names "Table 1", "Table 2",
> and the only one
> # of interest with the data, "grdRelSitGeralProcessos"
> #
> # From each such table, extract the following columns:
> #- Processo
> #- Endereço
> #- Distrito
> #- Area terreno (m2)
> #- Valor contrapartida ($)
> #- Area excedente (m2)
>
> # NB: All of the numeric fields use "." as comma-separator and "," as
> the decimal separator,
> #   but because of this are read in as character
>
>
> library(XML)
> link <-
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
>
> saopaulo <- htmlParse(link)
> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
> length(saopaulo.tables)
>
> # its the third table on this page we want
> sp.tab <- saopaulo.tables[[3]]
>
> # columns wanted
> wanted <- c(1, 2, 5, 7, 8, 13, 14)
> head(sp.tab[, wanted])
>
>  > head(sp.tab[, wanted])
>Proposta Processo EndereçoDistrito
> 11 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO
> VAN CLEVEVILA ANDRADE
> 22 2003-0.129.667-3  AV. DR. JOSÉ HIGINO,
> 200 E 216   AGUA RASA
> 33 2003-0.065.011-2   R. ALIANÇA LIBERAL,
> 980 E 990 VILA LEOPOLDINA
> 44 2003-0.165.806-0   R. ALIANÇA LIBERAL,
> 880 E 886 VILA LEOPOLDINA
> 55 2003-0.139.053-0R. DR. JOSÉ DE ANDRADE
> FIGUEIRA, 111VILA ANDRADE
> 66 2003-0.200.692-0R. JOSÉ DE
> JESUS, 66  VILA SONIA
>Ã rea Terreno (m2) Ã rea Excedente (m2) Valor Contrapartida (R$)
> 1   0,00 1.551,14 127.875,98
> 2   0,00 3.552,13 267.075,77
> 3 

Re: [R] web scraping tables generated in multiple server pages

2016-05-10 Thread Marco Silva
Excerpts from Michael Friendly's message of 2016-05-10 14:45:28 -0400:
> This is my first attempt to try R web scraping tools, for a project my 
> daughter is working on.  It concerns a data base of projects in Sao 
> Paulo, Brazil, listed at 
> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
>  
> but spread out over 69 pages accessed through a javascript menu at the 
> bottom of the page.
> 
> Each web page contains 3 HTML tables, of which only the last contains 
> the relevant data.  In this, only a subset of columns are of interest.  
> I tried using the XML package as illustrated on several tutorial pages, 
> as shown below.  I have no idea how to automate this to extract these 
> tables from multiple web pages.  Is there some other package better 
> suited to this task?  Can someone help me solve this and other issues?
> 
> # Goal: read the data tables contained on 69 pages generated by the link 
> below, where
> # each page is generated by a javascript link in the menu of the bottom 
> of the page.
> #
> # Each "page" contains 3 html tables, with names "Table 1", "Table 2", 
> and the only one
> # of interest with the data, "grdRelSitGeralProcessos"
> #
> # From each such table, extract the following columns:
> #- Processo
> #- Endereço
> #- Distrito
> #- Area terreno (m2)
> #- Valor contrapartida ($)
> #- Area excedente (m2)
> 
> # NB: All of the numeric fields use "." as comma-separator and "," as 
> the decimal separator,
> #   but because of this are read in as character
> 
> 
> library(XML)
> link <- 
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;
> 
> saopaulo <- htmlParse(link)
> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
> length(saopaulo.tables)
> 
> # its the third table on this page we want
> sp.tab <- saopaulo.tables[[3]]
> 
> # columns wanted
> wanted <- c(1, 2, 5, 7, 8, 13, 14)
> head(sp.tab[, wanted])
> 
>  > head(sp.tab[, wanted])
>Proposta Processo EndereçoDistrito
> 11 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO 
> VAN CLEVEVILA ANDRADE
> 22 2003-0.129.667-3  AV. DR. JOSÉ HIGINO, 
> 200 E 216   AGUA RASA
> 33 2003-0.065.011-2   R. ALIANÇA LIBERAL, 
> 980 E 990 VILA LEOPOLDINA
> 44 2003-0.165.806-0   R. ALIANÇA LIBERAL, 
> 880 E 886 VILA LEOPOLDINA
> 55 2003-0.139.053-0R. DR. JOSÉ DE ANDRADE 
> FIGUEIRA, 111VILA ANDRADE
> 66 2003-0.200.692-0R. JOSÉ DE 
> JESUS, 66  VILA SONIA
>Área Terreno (m2) Área Excedente (m2) Valor Contrapartida (R$)
> 1   0,00 1.551,14 127.875,98
> 2   0,00 3.552,13 267.075,77
> 3   0,00   624,99 70.212,93
> 4   0,00   395,64 44.447,18
> 5   0,00   719,68 41.764,46
> 6   0,00   446,52 85.152,92
> 
> thanks,
> 
> 
> -- 
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept. & Chair, Quantitative Methods
> York University  Voice: 416 736-2100 x66249 Fax: 416 736-5814
> 4700 Keele StreetWeb:http://www.datavis.ca
> Toronto, ONT  M3J 1P3 CANADA
> 
> 
# what is missing to you
?gsub
# aliasing
df <- sp.tab[, wanted]

# convert to double
as.double(  # convert to double
gsub(',', '.',  # makes the ',' to become '.'
gsub('\\.', '', df$"Área Excedente (m2)"))  # get rid of the dot

You can easily put the names of the columns and use lapply on them to
convert all of them in same manner, that is left as an exercise.


-- 
Marco Arthur @ (M)arco Creatives

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] web scraping tables generated in multiple server pages

2016-05-10 Thread Michael Friendly
This is my first attempt to try R web scraping tools, for a project my 
daughter is working on.  It concerns a data base of projects in Sao 
Paulo, Brazil, listed at 
http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
 
but spread out over 69 pages accessed through a javascript menu at the 
bottom of the page.

Each web page contains 3 HTML tables, of which only the last contains 
the relevant data.  In this, only a subset of columns are of interest.  
I tried using the XML package as illustrated on several tutorial pages, 
as shown below.  I have no idea how to automate this to extract these 
tables from multiple web pages.  Is there some other package better 
suited to this task?  Can someone help me solve this and other issues?

# Goal: read the data tables contained on 69 pages generated by the link 
below, where
# each page is generated by a javascript link in the menu of the bottom 
of the page.
#
# Each "page" contains 3 html tables, with names "Table 1", "Table 2", 
and the only one
# of interest with the data, "grdRelSitGeralProcessos"
#
# From each such table, extract the following columns:
#- Processo
#- Endereço
#- Distrito
#- Area terreno (m2)
#- Valor contrapartida ($)
#- Area excedente (m2)

# NB: All of the numeric fields use "." as comma-separator and "," as 
the decimal separator,
#   but because of this are read in as character


library(XML)
link <- 
"http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx;

saopaulo <- htmlParse(link)
saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
length(saopaulo.tables)

# its the third table on this page we want
sp.tab <- saopaulo.tables[[3]]

# columns wanted
wanted <- c(1, 2, 5, 7, 8, 13, 14)
head(sp.tab[, wanted])

 > head(sp.tab[, wanted])
   Proposta Processo EndereçoDistrito
11 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO 
VAN CLEVEVILA ANDRADE
22 2003-0.129.667-3  AV. DR. JOSÉ HIGINO, 
200 E 216   AGUA RASA
33 2003-0.065.011-2   R. ALIANÇA LIBERAL, 
980 E 990 VILA LEOPOLDINA
44 2003-0.165.806-0   R. ALIANÇA LIBERAL, 
880 E 886 VILA LEOPOLDINA
55 2003-0.139.053-0R. DR. JOSÉ DE ANDRADE 
FIGUEIRA, 111VILA ANDRADE
66 2003-0.200.692-0R. JOSÉ DE 
JESUS, 66  VILA SONIA
   Área Terreno (m2) Área Excedente (m2) Valor Contrapartida (R$)
1   0,00 1.551,14 127.875,98
2   0,00 3.552,13 267.075,77
3   0,00   624,99 70.212,93
4   0,00   395,64 44.447,18
5   0,00   719,68 41.764,46
6   0,00   446,52 85.152,92

thanks,


-- 
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University  Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele StreetWeb:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] web scraping image

2015-06-09 Thread boB Rudis
You can also do it with rvest  httr (but that does involve some parsing):

library(httr)
library(rvest)

url - 
http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img;
html(url) %%
  html_nodes(img) %%
  html_attr(src) %%
  paste0(http://nwis.waterdata.usgs.gov;, .) %%
  GET(write_disk(12144500.gif)) - status

Very readable and can be made programmatic pretty easily, too. Plus:
avoids direct use of the XML library. Future versions will no doubt
swap xml2 for XML as well.

-Bob


On Mon, Jun 8, 2015 at 2:09 PM, Curtis DeGasperi
curtis.degasp...@gmail.com wrote:
 Thanks to Jim's prompting, I think I came up with a fairly painless way to
 parse the HTML without having to write any parsing code myself using the
 function getHTMLExternalFiles in the XML package. A working version of the
 code follows:

 ## Code to process USGS peak flow data

 require(dataRetrieval)
 require(XML)

 ## Need to start with list of gauge ids to process

 siteno - c('12142000','12134500','12149000')

 lstas -length(siteno) #length of locator list

 print(paste('Processsing...',siteno[1],' ',siteno[1], sep = ))

 datall -  readNWISpeak(siteno[1])

 for (a in 2:lstas) {
   # Print station being processed
   print(paste('Processsing...',siteno[a], sep = ))

   dat-  readNWISpeak(siteno[a])

   datall - rbind(datall,dat)

 }

 write.csv(datall, file = usgs_peaks.csv)

 # Retrieve ascii text files and graphics
 for (a in 1:lstas) {

   print(paste('Processsing...',siteno[a], sep = ))

   graphic.url -
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'agency_cd=USGSformat=img',
 sep = )
   usgs.img - getHTMLExternalFiles(graphic.url)
   graphic.img - paste('http://nwis.waterdata.usgs.gov',usgs.img, sep = )

   peakfq.url -
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'agency_cd=USGSformat=hn2',
 sep = )
   tab.url  - 
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'agency_cd=USGSformat=rdb',
 sep = )

   graphic.fn - paste('graphic_',siteno[a],'.gif', sep = )
   peakfq.fn - paste('peakfq_',siteno[a],'.txt', sep = )
   tab.fn  - paste('tab_',siteno[a],'.txt', sep = )
   download.file(graphic.img,graphic.fn,mode='wb')
   download.file(peakfq.url,peakfq.fn)
   download.file(tab.url,tab.fn)
 }

 --

 Message: 34
 Date: Fri, 5 Jun 2015 08:59:04 +1000
 From: Jim Lemon drjimle...@gmail.com
 To: Curtis DeGasperi curtis.degasp...@gmail.com
 Cc: r-help mailing list r-help@r-project.org
 Subject: Re: [R] web scraping image
 Message-ID:
 
 ca+8x3fv0ajw+e22jayv1gfm6jr_tazua5fwgd3t_mfgfqy2...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 Hi Chris,
 I don't have the packages you are using, but tracing this indicates
 that the page source contains the relative path of the graphic, in
 this case:

 /nwisweb/data/img/USGS.12144500.19581112.20140309..0.peak.pres.gif

 and you already have the server URL:

 nwis.waterdata.usgs.gov

 getting the path out of the page source isn't difficult, just split
 the text at double quotes and get the token following img src=. If I
 understand the arguments of download.file correctly, the path is the
 graphic.fn argument and the server URL is the graphic.url argument. I
 would paste them together and display the result to make sure that it
 matches the image you want. When I did this, the correct image
 appeared in my browser. I'm using Google Chrome, so I don't have to
 prepend the http://

 Jim

 On Fri, Jun 5, 2015 at 2:31 AM, Curtis DeGasperi
 curtis.degasp...@gmail.com wrote:
 I'm working on a script that downloads data from the USGS NWIS server.
 dataRetrieval makes it easy to quickly get the data in a neat tabular
 format, but I was also interested in getting the tabular text files -
 also fairly easy for me using download.file.

 However, I'm not skilled enough to work out how to download the nice
 graphic files that can be produced dynamically from the USGS NWIS
 server (for example:

 http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img
 )

 My question is how do I get the image from this web page and save it
 to a local directory? scrapeR returns the information from the page
 and I suspect this is a possible solution path, but I don't know what
 the next step is.

 My code provided below works from a list I've created of USGS flow
 gauging stations.

 Curtis

 ## Code to process USGS daily flow data for high and low flow analysis
 ## Need to start with list of gauge ids to process
 ## Can't figure out how to automate download of images

 require(dataRetrieval)
 require(data.table)
 require(scrapeR)

 df - read.csv(usgs_stations.csv, header=TRUE)

 lstas -length(df$siteno) #length of locator list

 print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

 datall -  readNWISpeak(df$siteno[1])

 for (a in 2:lstas) {
   # Print station being processed
   print(paste('Processsing...',df$name[a],' ',df$siteno[a], sep = ))

   dat

Re: [R] web scraping image

2015-06-08 Thread Curtis DeGasperi
Thanks to Jim's prompting, I think I came up with a fairly painless way to
parse the HTML without having to write any parsing code myself using the
function getHTMLExternalFiles in the XML package. A working version of the
code follows:

## Code to process USGS peak flow data

require(dataRetrieval)
require(XML)

## Need to start with list of gauge ids to process

siteno - c('12142000','12134500','12149000')

lstas -length(siteno) #length of locator list

print(paste('Processsing...',siteno[1],' ',siteno[1], sep = ))

datall -  readNWISpeak(siteno[1])

for (a in 2:lstas) {
  # Print station being processed
  print(paste('Processsing...',siteno[a], sep = ))

  dat-  readNWISpeak(siteno[a])

  datall - rbind(datall,dat)

}

write.csv(datall, file = usgs_peaks.csv)

# Retrieve ascii text files and graphics
for (a in 1:lstas) {

  print(paste('Processsing...',siteno[a], sep = ))

  graphic.url -
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'agency_cd=USGSformat=img',
sep = )
  usgs.img - getHTMLExternalFiles(graphic.url)
  graphic.img - paste('http://nwis.waterdata.usgs.gov',usgs.img, sep = )

  peakfq.url -
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'agency_cd=USGSformat=hn2',
sep = )
  tab.url  - 
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',siteno[a],'agency_cd=USGSformat=rdb',
sep = )

  graphic.fn - paste('graphic_',siteno[a],'.gif', sep = )
  peakfq.fn - paste('peakfq_',siteno[a],'.txt', sep = )
  tab.fn  - paste('tab_',siteno[a],'.txt', sep = )
  download.file(graphic.img,graphic.fn,mode='wb')
  download.file(peakfq.url,peakfq.fn)
  download.file(tab.url,tab.fn)
}

 --

 Message: 34
 Date: Fri, 5 Jun 2015 08:59:04 +1000
 From: Jim Lemon drjimle...@gmail.com
 To: Curtis DeGasperi curtis.degasp...@gmail.com
 Cc: r-help mailing list r-help@r-project.org
 Subject: Re: [R] web scraping image
 Message-ID:
 
ca+8x3fv0ajw+e22jayv1gfm6jr_tazua5fwgd3t_mfgfqy2...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 Hi Chris,
 I don't have the packages you are using, but tracing this indicates
 that the page source contains the relative path of the graphic, in
 this case:

 /nwisweb/data/img/USGS.12144500.19581112.20140309..0.peak.pres.gif

 and you already have the server URL:

 nwis.waterdata.usgs.gov

 getting the path out of the page source isn't difficult, just split
 the text at double quotes and get the token following img src=. If I
 understand the arguments of download.file correctly, the path is the
 graphic.fn argument and the server URL is the graphic.url argument. I
 would paste them together and display the result to make sure that it
 matches the image you want. When I did this, the correct image
 appeared in my browser. I'm using Google Chrome, so I don't have to
 prepend the http://

 Jim

 On Fri, Jun 5, 2015 at 2:31 AM, Curtis DeGasperi
 curtis.degasp...@gmail.com wrote:
 I'm working on a script that downloads data from the USGS NWIS server.
 dataRetrieval makes it easy to quickly get the data in a neat tabular
 format, but I was also interested in getting the tabular text files -
 also fairly easy for me using download.file.

 However, I'm not skilled enough to work out how to download the nice
 graphic files that can be produced dynamically from the USGS NWIS
 server (for example:

http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img
)

 My question is how do I get the image from this web page and save it
 to a local directory? scrapeR returns the information from the page
 and I suspect this is a possible solution path, but I don't know what
 the next step is.

 My code provided below works from a list I've created of USGS flow
 gauging stations.

 Curtis

 ## Code to process USGS daily flow data for high and low flow analysis
 ## Need to start with list of gauge ids to process
 ## Can't figure out how to automate download of images

 require(dataRetrieval)
 require(data.table)
 require(scrapeR)

 df - read.csv(usgs_stations.csv, header=TRUE)

 lstas -length(df$siteno) #length of locator list

 print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

 datall -  readNWISpeak(df$siteno[1])

 for (a in 2:lstas) {
   # Print station being processed
   print(paste('Processsing...',df$name[a],' ',df$siteno[a], sep = ))

   dat-  readNWISpeak(df$siteno[a])

   datall - rbind(datall,dat)

 }

 write.csv(datall, file = usgs_peaks.csv)

 # Retrieve ascii text files and graphics

 for (a in 1:lstas) {

   print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

   graphic.url -
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=
',df$siteno[a],'agency_cd=USGSformat=img',
 sep = )
   peakfq.url -
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=
',df$siteno[a],'agency_cd=USGSformat=hn2',
 sep = )
   tab.url  - paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=
',df$siteno[a],'agency_cd=USGSformat=rdb',
 sep = )

   graphic.fn - paste

Re: [R] web scraping image

2015-06-04 Thread Jim Lemon
Hi Chris,
I don't have the packages you are using, but tracing this indicates
that the page source contains the relative path of the graphic, in
this case:

/nwisweb/data/img/USGS.12144500.19581112.20140309..0.peak.pres.gif

and you already have the server URL:

nwis.waterdata.usgs.gov

getting the path out of the page source isn't difficult, just split
the text at double quotes and get the token following img src=. If I
understand the arguments of download.file correctly, the path is the
graphic.fn argument and the server URL is the graphic.url argument. I
would paste them together and display the result to make sure that it
matches the image you want. When I did this, the correct image
appeared in my browser. I'm using Google Chrome, so I don't have to
prepend the http://

Jim

On Fri, Jun 5, 2015 at 2:31 AM, Curtis DeGasperi
curtis.degasp...@gmail.com wrote:
 I'm working on a script that downloads data from the USGS NWIS server.
 dataRetrieval makes it easy to quickly get the data in a neat tabular
 format, but I was also interested in getting the tabular text files -
 also fairly easy for me using download.file.

 However, I'm not skilled enough to work out how to download the nice
 graphic files that can be produced dynamically from the USGS NWIS
 server (for example:
 http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img)

 My question is how do I get the image from this web page and save it
 to a local directory? scrapeR returns the information from the page
 and I suspect this is a possible solution path, but I don't know what
 the next step is.

 My code provided below works from a list I've created of USGS flow
 gauging stations.

 Curtis

 ## Code to process USGS daily flow data for high and low flow analysis
 ## Need to start with list of gauge ids to process
 ## Can't figure out how to automate download of images

 require(dataRetrieval)
 require(data.table)
 require(scrapeR)

 df - read.csv(usgs_stations.csv, header=TRUE)

 lstas -length(df$siteno) #length of locator list

 print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

 datall -  readNWISpeak(df$siteno[1])

 for (a in 2:lstas) {
   # Print station being processed
   print(paste('Processsing...',df$name[a],' ',df$siteno[a], sep = ))

   dat-  readNWISpeak(df$siteno[a])

   datall - rbind(datall,dat)

 }

 write.csv(datall, file = usgs_peaks.csv)

 # Retrieve ascii text files and graphics

 for (a in 1:lstas) {

   print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

   graphic.url -
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',df$siteno[a],'agency_cd=USGSformat=img',
 sep = )
   peakfq.url -
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',df$siteno[a],'agency_cd=USGSformat=hn2',
 sep = )
   tab.url  - 
 paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',df$siteno[a],'agency_cd=USGSformat=rdb',
 sep = )

   graphic.fn - paste('graphic_',df$siteno[a],'.gif', sep = )
   peakfq.fn - paste('peakfq_',df$siteno[a],'.txt', sep = )
   tab.fn  - paste('tab_',df$siteno[a],'.txt', sep = )

   download.file(graphic.url,graphic.fn,mode='wb') # This apparently
 doesn't work - file is empty
   download.file(peakfq.url,peakfq.fn)
   download.file(tab.url,tab.fn)
 }

 # scrapeR
 pageSource-scrape(url=http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img,headers=TRUE,
 parse=FALSE)
 page-scrape(object=pageSource)

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] web scraping image

2015-06-04 Thread Curtis DeGasperi
I'm working on a script that downloads data from the USGS NWIS server.
dataRetrieval makes it easy to quickly get the data in a neat tabular
format, but I was also interested in getting the tabular text files -
also fairly easy for me using download.file.

However, I'm not skilled enough to work out how to download the nice
graphic files that can be produced dynamically from the USGS NWIS
server (for example:
http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img)

My question is how do I get the image from this web page and save it
to a local directory? scrapeR returns the information from the page
and I suspect this is a possible solution path, but I don't know what
the next step is.

My code provided below works from a list I've created of USGS flow
gauging stations.

Curtis

## Code to process USGS daily flow data for high and low flow analysis
## Need to start with list of gauge ids to process
## Can't figure out how to automate download of images

require(dataRetrieval)
require(data.table)
require(scrapeR)

df - read.csv(usgs_stations.csv, header=TRUE)

lstas -length(df$siteno) #length of locator list

print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

datall -  readNWISpeak(df$siteno[1])

for (a in 2:lstas) {
  # Print station being processed
  print(paste('Processsing...',df$name[a],' ',df$siteno[a], sep = ))

  dat-  readNWISpeak(df$siteno[a])

  datall - rbind(datall,dat)

}

write.csv(datall, file = usgs_peaks.csv)

# Retrieve ascii text files and graphics

for (a in 1:lstas) {

  print(paste('Processsing...',df$name[1],' ',df$siteno[1], sep = ))

  graphic.url -
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',df$siteno[a],'agency_cd=USGSformat=img',
sep = )
  peakfq.url -
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',df$siteno[a],'agency_cd=USGSformat=hn2',
sep = )
  tab.url  - 
paste('http://nwis.waterdata.usgs.gov/nwis/peak?site_no=',df$siteno[a],'agency_cd=USGSformat=rdb',
sep = )

  graphic.fn - paste('graphic_',df$siteno[a],'.gif', sep = )
  peakfq.fn - paste('peakfq_',df$siteno[a],'.txt', sep = )
  tab.fn  - paste('tab_',df$siteno[a],'.txt', sep = )

  download.file(graphic.url,graphic.fn,mode='wb') # This apparently
doesn't work - file is empty
  download.file(peakfq.url,peakfq.fn)
  download.file(tab.url,tab.fn)
}

# scrapeR
pageSource-scrape(url=http://nwis.waterdata.usgs.gov/nwis/peak?site_no=12144500agency_cd=USGSformat=img,headers=TRUE,
parse=FALSE)
page-scrape(object=pageSource)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Web Scraping

2013-10-04 Thread Mohamed Anany
Hello everybody,
I just started using R and I'm presenting a poster for R day at Kennesaw
State University and I really need some help in terms of web scraping.
I'm trying to extract used cars data from www.cars.com to include the
mileage, year, model, make, price, CARFAX availability and Technology
package availability. I've done some research, and everything points to the
XML package and RCurl package. I also got my hands on a function that would
capture all the text in the web page and store as a huge character vector.
I've never done data mining before so when i read the help documents on the
packages i mentioned earlier is like reading Chinese. I would appreciate it
if you guide me through this process of data extraction.
Here's an example of what the data would look like:

CostYearMileageTechCARFAXMake  Model
$32000 1999   57,987  1 FREEAudi   A4

Here's the link to the search:-
http://www.cars.com/for-sale/searchresults.action?stkTyp=Utracktype=usedccmkId=20049AmbMkId=20049AmbMkNm=Audimake=AudiAmbMdNm=A4model=A4mdId=20596AmbMdId=20596rd=100zc=30062searchSource=QUICK_FORMenableSeo=1

I'm not expecting you to write the whole code for me, but just some
guidance and where to start and what functions would be useful in my
situation.
Thanks a lot anyway.

Regards,
M. Samir Anany

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Web Scraping

2013-10-04 Thread Ista Zahn
Hi,

I have a short demo at https://gist.github.com/izahn/5785265 that
might get you started.

Best,
Ista

On Fri, Oct 4, 2013 at 12:51 PM, Mohamed Anany
melsa...@students.kennesaw.edu wrote:
 Hello everybody,
 I just started using R and I'm presenting a poster for R day at Kennesaw
 State University and I really need some help in terms of web scraping.
 I'm trying to extract used cars data from www.cars.com to include the
 mileage, year, model, make, price, CARFAX availability and Technology
 package availability. I've done some research, and everything points to the
 XML package and RCurl package. I also got my hands on a function that would
 capture all the text in the web page and store as a huge character vector.
 I've never done data mining before so when i read the help documents on the
 packages i mentioned earlier is like reading Chinese. I would appreciate it
 if you guide me through this process of data extraction.
 Here's an example of what the data would look like:

 CostYearMileageTechCARFAXMake  Model
 $32000 1999   57,987  1 FREEAudi   A4

 Here's the link to the search:-
 http://www.cars.com/for-sale/searchresults.action?stkTyp=Utracktype=usedccmkId=20049AmbMkId=20049AmbMkNm=Audimake=AudiAmbMdNm=A4model=A4mdId=20596AmbMdId=20596rd=100zc=30062searchSource=QUICK_FORMenableSeo=1

 I'm not expecting you to write the whole code for me, but just some
 guidance and where to start and what functions would be useful in my
 situation.
 Thanks a lot anyway.

 Regards,
 M. Samir Anany

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.