In the past I've used regular expressions, but after hearing Alvaro
mention tidy+xpath at a UPHPU meeting, I started using that. I've
loved it. SimpleXML is easy to use. I haven't ventured into XSLT, like
Ray suggested, but tidy+xpath has been great.
On a similar note, I've been looking at SimpleUnit's Web Testing
module and it seems pretty powerful. You can use it for far more than
unit testing. It's like a scriptable browser, in which you can "click"
links, fill out forms, work with cookies, etc. The example on the
website shows how to perform an automated Google search:
http://www.simpletest.org/en/start-testing.html#web
Richard
On Sep 25, 2008, at 9:44 AM, Alvaro Carrasco wrote:
I forgot one thing: Scriptable Browser.
http://www.lastcraft.com/browser_documentation.php
This makes it really easy to deal with forms, authentication, clicking
on links, etc.
Seriously, the combination of scriptable browser, tidy, and xpath
makes
scraping a piece of cake.
Alvaro
Alvaro Carrasco wrote:
In my experience, the easiest way is: run website through tidy,
load it
into a DOMDocument, and use xpath.
The xpath patterns are SO much easier to read and write than regex
and
more resistant to changes to the website (if you write them
correctly).
You can also use regex within xpath if you ever need it.
_______________________________________________
UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net