Re: [UPHPU] Web site scraping

thebigdog Thu, 25 Sep 2008 09:23:49 -0700

Nathan Lane wrote:
> I want to make what in effect is a website scraper using PHP, but it isn't
> obvious how this would best be done. I've tried using DOMDocument and I'm
> not sure if that's the best option or not. I'd really like to use something
> where I could use XPath to get the elements out that I want. Recently I
> wrote a similar program in C# that I call HttpAnalyzer. Could I just use
> that with PHP (i.e. call it from PHP) to get what I'm looking for? Any
> suggestions?


i would agree with alvaro and walt. You could actually combine the 2
suggestions...I have done the following:

1. download the page
2. run the page through tidy (cleanup tags)
3. applied xslt transform with dom
4. retrieve the results

This has worked really well in terms of speed and the amount of data that I have
used. xslt can contain logic which is really nice. by using xslt i can create
various transformation providing greater flexibility and customization and i can
still use all the xml technologies like xpath.


-- 
thebigdog

_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Re: [UPHPU] Web site scraping

Reply via email to