: Re: [R] Scraping a web page.
Message-ID:
CAP01uR=zdxHocxpsZdpT+4Kx2=L2vr9jnr=i=_Qhs39O=qo...@mail.gmail.com
Content-Type: text/plain; charset=ISO-8859-1
On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub kw1...@gmail.com wrote:
Thanks,
?That was very helpful.
I am using readLines
for your detailed reply,
KW
Message: 139
Date: Tue, 15 May 2012 21:02:05 -0700
From: Duncan Temple Lang dun...@wald.ucdavis.edu
To: r-help@r-project.org
Subject: Re: [R] Scraping a web page.
Message-ID: 4fb326bd.9080...@wald.ucdavis.edu
Content-Type: text/plain; charset=ISO-8859-1
Hi Keith
Thanks,
That was very helpful.
I am using readLines and grep. If grep isn't powerful enough I might end up
using the XML package but I hope that won't be necessary.
Thanks again,
KW
--
On May 14, 2012, at 7:18 PM, J Toll wrote:
On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub
On Tue, May 15, 2012 at 7:06 AM, Keith Weintraub kw1...@gmail.com wrote:
Thanks,
That was very helpful.
I am using readLines and grep. If grep isn't powerful enough I might end up
using the XML package but I hope that won't be necessary.
This only uses readLines and strapplyc (from
Hi Keith
Of course, it doesn't necessarily matter how you get the job done
if it actually works correctly. But for a general approach,
it is useful to use general tools and can lead to more correct,
more robust, and more maintainable code.
Since htmlParse() in the XML package can both
Folks,
I want to scrape a series of web-page sources for strings like the following:
/en/Ships/A-8605507.html
/en/Ships/Aalborg-8122830.html
which appear in an href inside an a tag inside a div tag inside a table.
In fact all I want is the (exactly) 7-digit number before .html.
The good news
On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub kw1...@gmail.com wrote:
Folks,
I want to scrape a series of web-page sources for strings like the following:
/en/Ships/A-8605507.html
/en/Ships/Aalborg-8122830.html
which appear in an href inside an a tag inside a div tag inside a table.
I would like to be able to submit a list of URLs of various webpages and
extract the content i.e. not the mark-up of those pages. I can find plenty of
examples in the XML library of extracting links from pages but I cannot seem to
find a way to extract the text. Any help would be greatly
If you only need to grab text it can be conveniently done with lynx. This
example is for Windows but its nearly the same on other platforms:
out - shell(lynx.bat --dump --nolist http://www.google.com;, intern =
TRUE)
head(out)
[1]
[2]Web Images Videos Maps News Books Gmail more »
[3]
Michael Conklin wrote:
I would like to be able to submit a list of URLs of various webpages and
extract the content i.e. not the mark-up of those pages. I can find
plenty of examples in the XML library of extracting links from pages but I
cannot seem to find a way to extract the text. Any
If you're after text, then it's probably a matter of locating the element
that encloses the data you want-- perhaps by using getNodeSet along with an
XPath[1] that specifies the element you are interest with. The text can
then be recovered using the xmlValue() function.
And rather than
Hi Michael
If you just want all of the text that is displayed in the
HTML docment, then you might use an XPath expression to get
all the text() nodes and get their value.
An example is
doc = htmlParse(http://www.omegahat.org/;)
txt = xpathSApply(doc, //body//text(), xmlValue)
The result
12 matches
Mail list logo