Oops, I just noticed that the link Alvaro sent refers to the same SimpleTest (not SimpleUnit) framework that I mentioned. Well not exactly, but it uses the same base code. The owner of lastcraft.com is the creator of SimpleTest. My bad.

Richard





On Sep 25, 2008, at 2:40 PM, Richard K Miller wrote:

In the past I've used regular expressions, but after hearing Alvaro mention tidy+xpath at a UPHPU meeting, I started using that. I've loved it. SimpleXML is easy to use. I haven't ventured into XSLT, like Ray suggested, but tidy+xpath has been great.

On a similar note, I've been looking at SimpleUnit's Web Testing module and it seems pretty powerful. You can use it for far more than unit testing. It's like a scriptable browser, in which you can "click" links, fill out forms, work with cookies, etc. The example on the website shows how to perform an automated Google search:

http://www.simpletest.org/en/start-testing.html#web

Richard



On Sep 25, 2008, at 9:44 AM, Alvaro Carrasco wrote:

I forgot one thing: Scriptable Browser.
http://www.lastcraft.com/browser_documentation.php

This makes it really easy to deal with forms, authentication, clicking
on links, etc.

Seriously, the combination of scriptable browser, tidy, and xpath makes
scraping a piece of cake.

Alvaro

Alvaro Carrasco wrote:
In my experience, the easiest way is: run website through tidy, load it
into a DOMDocument, and use xpath.

The xpath patterns are SO much easier to read and write than regex and more resistant to changes to the website (if you write them correctly).
You can also use regex within xpath if you ever need it.




_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Reply via email to