Re: [R] Scraping a web page.

2012-05-16 Thread Keith Weintraub
: Re: [R] Scraping a web page. Message-ID: CAP01uR=zdxHocxpsZdpT+4Kx2=L2vr9jnr=i=_Qhs39O=qo...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub kw1...@gmail.com wrote: Thanks, ?That was very helpful. I am using readLines

[R] Scraping a web page.

2012-05-16 Thread Keith Weintraub
for your detailed reply, KW Message: 139 Date: Tue, 15 May 2012 21:02:05 -0700 From: Duncan Temple Lang dun...@wald.ucdavis.edu To: r-help@r-project.org Subject: Re: [R] Scraping a web page. Message-ID: 4fb326bd.9080...@wald.ucdavis.edu Content-Type: text/plain; charset=ISO-8859-1 Hi Keith

Re: [R] Scraping a web page.

2012-05-15 Thread Keith Weintraub
Thanks, That was very helpful. I am using readLines and grep. If grep isn't powerful enough I might end up using the XML package but I hope that won't be necessary. Thanks again, KW -- On May 14, 2012, at 7:18 PM, J Toll wrote: On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub

Re: [R] Scraping a web page.

2012-05-15 Thread Gabor Grothendieck
On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub kw1...@gmail.com wrote: Thanks,  That was very helpful. I am using readLines and grep. If grep isn't powerful enough I might end up using the XML package but I hope that won't be necessary. This only uses readLines and strapplyc (from

Re: [R] Scraping a web page.

2012-05-15 Thread Duncan Temple Lang
Hi Keith Of course, it doesn't necessarily matter how you get the job done if it actually works correctly. But for a general approach, it is useful to use general tools and can lead to more correct, more robust, and more maintainable code. Since htmlParse() in the XML package can both

[R] Scraping a web page.

2012-05-14 Thread Keith Weintraub
Folks, I want to scrape a series of web-page sources for strings like the following: /en/Ships/A-8605507.html /en/Ships/Aalborg-8122830.html which appear in an href inside an a tag inside a div tag inside a table. In fact all I want is the (exactly) 7-digit number before .html. The good news

Re: [R] Scraping a web page.

2012-05-14 Thread J Toll
On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub kw1...@gmail.com wrote: Folks,  I want to scrape a series of web-page sources for strings like the following: /en/Ships/A-8605507.html /en/Ships/Aalborg-8122830.html which appear in an href inside an a tag inside a div tag inside a table.

[R] Scraping a web page

2009-12-03 Thread Michael Conklin
I would like to be able to submit a list of URLs of various webpages and extract the content i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text. Any help would be greatly

Re: [R] Scraping a web page

2009-12-03 Thread Gabor Grothendieck
If you only need to grab text it can be conveniently done with lynx. This example is for Windows but its nearly the same on other platforms: out - shell(lynx.bat --dump --nolist http://www.google.com;, intern = TRUE) head(out) [1] [2]Web Images Videos Maps News Books Gmail more » [3]

Re: [R] Scraping a web page

2009-12-03 Thread Sharpie
Michael Conklin wrote: I would like to be able to submit a list of URLs of various webpages and extract the content i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text. Any

Re: [R] Scraping a web page

2009-12-03 Thread hadley wickham
If you're after text, then it's probably a matter of locating the element that encloses the data you want-- perhaps by using getNodeSet along with an XPath[1] that specifies the element you are interest with.  The text can then be recovered using the xmlValue() function. And rather than

Re: [R] Scraping a web page

2009-12-03 Thread Duncan Temple Lang
Hi Michael If you just want all of the text that is displayed in the HTML docment, then you might use an XPath expression to get all the text() nodes and get their value. An example is doc = htmlParse(http://www.omegahat.org/;) txt = xpathSApply(doc, //body//text(), xmlValue) The result