On 03/03/2011 08:07 AM, Mike Marchywka wrote:







Date: Thu, 3 Mar 2011 01:22:44 -0800
From: antuj...@gmail.com
To: r-help@r-project.org
Subject: [R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go

In general this can be a big effort but there may be things in
text processing packages you could adapt to execute html and javascript.
However, I guess what I'd be looking for is something like a "webkit"
package or other open source browser with or without an "R" interface.
This actually may be an ideal solution for a lot of things as you get
all the content handlers of at least some browser.


Now that you mention it, I wonder if there are browser plugins to handle
"R" content ( I'd have to give this some thought, put a script up as
a web page with mime type "test/R" and have it execute it in R. )

There are server-side solutions for this sort of thing. See http://rapache.net/ . Also, there was a string of messages on R-devel some years ago addressing the mime type issue; beginning here: http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't know whether there was a resolution. Some suggestions were text/x-R, text/x-Rd, application/x-RData.

-Matt




about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w<-
getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes";)
write.table(w,"test.txt")
t<- readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.


--
View this message in context: 
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
                                        
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Matthew S Shotwell   Assistant Professor           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to