If you go the java httpclient path, you will probably find tagsoup
(http://home.ccil.org/~cowan/XML/tagsoup/) helpful.

Also, you might want to check out Open QA's Selenium
(http://www.openqa.org/selenium/).  It's intended to be used as a test
tool, but you might find it useful if faced with particularly nasty
javascript in the webpages you are intending to scrape.

Josh

On 10/16/06, Phillip Rhodes <[EMAIL PROTECTED]> wrote:
Owen Berry wrote:
> If you can write Perl code, take a look at LWP::UserAgent,
> HTML::TreeBuilder and HTTP::Cookies (if you need cookies) for it to
> work.  I've used this to bulk retrieve information off a website (with
> permission) using forms, cookies etc.
>
Of if Java appeals to you, take a look at Jakarta HTTPClient:

 <http://jakarta.apache.org/commons/httpclient/>


TTYL,

Phil
--
TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
TriLUG Organizational FAQ  : http://trilug.org/faq/
TriLUG Member Services FAQ : http://members.trilug.org/services_faq/

--
TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
TriLUG Organizational FAQ  : http://trilug.org/faq/
TriLUG Member Services FAQ : http://members.trilug.org/services_faq/

Reply via email to