Re: [UPHPU] Web site scraping

Alvaro Carrasco Thu, 25 Sep 2008 08:46:16 -0700

I forgot one thing: Scriptable Browser.
http://www.lastcraft.com/browser_documentation.php


This makes it really easy to deal with forms, authentication, clicking
on links, etc.

Seriously, the combination of scriptable browser, tidy, and xpath makes
scraping a piece of cake.

Alvaro

Alvaro Carrasco wrote:
> In my experience, the easiest way is: run website through tidy, load it
> into a DOMDocument, and use xpath.
> 
> The xpath patterns are SO much easier to read and write than regex and
> more resistant to changes to the website (if you write them correctly).
> You can also use regex within xpath if you ever need it.
> 
> Alvaro
> 
> Nathan Lane wrote:
>> I want to make what in effect is a website scraper using PHP, but it isn't
>> obvious how this would best be done. I've tried using DOMDocument and I'm
>> not sure if that's the best option or not. I'd really like to use something
>> where I could use XPath to get the elements out that I want. Recently I
>> wrote a similar program in C# that I call HttpAnalyzer. Could I just use
>> that with PHP (i.e. call it from PHP) to get what I'm looking for? Any
>> suggestions?


_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Re: [UPHPU] Web site scraping

Reply via email to