Re: [R-sig-eco] Help with function to webscrap

Kay Cichini Thu, 28 Jun 2012 03:15:33 -0700

Hi,

no need for RCurl - this should suffice:


require(XML)

input = "panthera-uncia"
h <- htmlParse(paste("http://api.iucnredlist.org/go/";,
               input, sep = ""))

(status <- xpathSApply(h, '//div[@id="red_list_category_code"]', xmlValue))
[1] "EN"

Many thanks for pointing up the IUCN-API, Eduard - it is awesome!

Best,
Kay

2012/6/27 Eduard Szöcs <szoe8...@uni-landau.de>

> Hai Augusto,
>
> regarding question #3:
> You could use the red list API with RCurl and XML packages.
> Here is an example:
>
> > require(RCurl)
> > require(XML)
> > get_IUCN_status <- function(x) {
> +   spec <- tolower(x)
> +   spec <- gsub(" ", "-", spec)
> +   url <- 
> paste("http://api.iucnredlist.**org/go/<http://api.iucnredlist.org/go/>",
> spec, sep="")
> +   get <- getURL(url, followlocation = TRUE)
> +   h <- htmlParse(get)
> +   status <- xpathSApply(h, '//div[@id ="red_list_category_code"]',
> xmlValue)
> +   return(status)
> + }
> >
> > get_IUCN_status("Panthera uncia")
> [1] "EN"
>
> For more resources just type 'webscraping R' in your favourite search
> engine.
>
> HTH,
>
> Eduard
>
>
> On 26/06/12 20:57, Augusto Ribas wrote:
>
>> Hello.
>> I'm haveing problems with a function to do webscrap.
>> I have a long list like this example:
>>
>> data<-data.frame(especie=c("**Rana pipiens","Rana vaillanti","Ctenosaura
>> similis","Bos taurus"),group=c("sapo","sapo"**,"reptil","mamifero"))
>>
>> And, as some species names are out of data, i trying to make a
>> function to check catalogue of life 
>> (http://www.catalogueoflife.**org/<http://www.catalogueoflife.org/>
>> )
>> and get the current names.
>> This have some problems, like species name that split, but help as a
>> first check.
>>
>> So i made this function to web scrap the data.
>> Its simple, it search the site, makeing a link with the keywords, then
>> enter the first link of the list of results produced and get the
>> accepted name and author, giveing the results as a list.
>> for example:
>>
>>  sp.check("Rana pipiens")
>>>
>> $sp.aceito
>> [1] "Lithobates pipiens"
>>
>> $autor
>> [1] "Schreber, 1782"
>>
>> But sometimes the function cannot acess the internet, and give a error.
>>
>> I'm made this function trying to copy some examples on foruns, but i
>> have some doubts:
>>
>> 01) How do i supress the readlines() warnings?
>>
>> 02) How can i make the function try again when cannot acess internet,
>> or just print something like "Cant acess internet", or when i try
>> something like:
>>
>> data$check<-NA
>> for(i in 1:nrow(data)) {
>>  data$check[i]<-sp.check(data$**especie[i])
>>  }
>>
>> the loop dont stop.
>> I made a short list, but when with 500 or more lines it usually stop
>> in the middle.
>>
>> 03) Anyone have an example how to scrap http://www.iucnredlist.org/
>> the status of species, as it does not use the keyword in the link? Is
>> there any tutorial simple for someone without any background on
>> programing or computer science?
>>
>>
>> Well thanks for the attention.
>>
>> #função sp.check
>>
>> sp.check<-function(especie) {
>> #split species name
>> especie<-as.character(especie)
>>
>> gen<-strsplit(especie,"\\ ")[[1]][1]
>> esp<-strsplit(especie,"\\ ")[[1]][2]
>>
>> #makeing first link
>> link<-paste("http://www.**catalogueoflife.org/col/**search/all/key/<http://www.catalogueoflife.org/col/search/all/key/>
>> ",gen,"+",esp,"**/match/1",sep="")
>> link <- iconv(link, 'latin1', 'UTF-8')
>> Encoding(link) <- 'bytes'
>>
>> #reading table of results
>> pagina <- readLines(url(link))
>>
>> n.linhas<-which(pagina%in%"        <td class=\"field_header_black\">"**)
>>
>> #is there any results?
>> if(length(n.linhas)>0) {
>>
>> pag.sp<-strsplit(pagina[n.**linhas[1]+1],'\\"')[[1]][2]
>>
>> #second link
>> link2 <- paste( 
>> "http://www.catalogueoflife.**org<http://www.catalogueoflife.org>
>> ",pag.sp,sep="")
>> link2 <- iconv(link2, 'latin1', 'UTF-8')
>> Encoding(link2) <- 'bytes'
>> link2
>>
>> #read
>> pagina2 <- readLines(url(link2))
>>
>> #get line of interest
>> linha2<-grep('(accepted name)',pagina2)
>> sp.final<-pagina2[linha2]
>>
>> #get species name
>> corte1<-strsplit(sp.final,'<i>**')[[1]][2]
>> sp.aceito<-strsplit(corte1,'</**i>')[[1]][1]
>>
>> #get author
>> corte2<-strsplit(sp.final,'\\(**')[[1]][2]
>> autor<-strsplit(corte2,')')[[**1]][1]
>> }else {
>> sp.aceito<-c("Não encontrado")
>> autor<-c("Não encontrado")
>> }
>> return(list(sp.aceito=sp.**aceito,autor=autor))
>> }
>>
>> --
>> Grato
>> Augusto C. A. Ribas
>>
>> Site Pessoal: 
>> http://augustoribas.heliohost.**org<http://augustoribas.heliohost.org>
>> Lattes: 
>> http://lattes.cnpq.br/**7355685961127056<http://lattes.cnpq.br/7355685961127056>
>>
>> ______________________________**_________________
>> R-sig-ecology mailing list
>> R-sig-ecology@r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/r-sig-ecology<https://stat.ethz.ch/mailman/listinfo/r-sig-ecology>
>>
>>
> ______________________________**_________________
> R-sig-ecology mailing list
> R-sig-ecology@r-project.org
> https://stat.ethz.ch/mailman/**listinfo/r-sig-ecology<https://stat.ethz.ch/mailman/listinfo/r-sig-ecology>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] Help with function to webscrap

Reply via email to