Re: [SLUG] Invision phpBB Site Content ripping

Daniel Pittman Tue, 06 Oct 2009 00:06:52 -0700

Amos Shapira <[email protected]> writes:
> 2009/10/6 Kyle <[email protected]>:
>> Hi Folks,
>>
>> how hard/easy would it be to get something written which could log onto one
>> IP.Board forum, crawl that site and download the content only, to import
>> into another IP.board db?
>>
>> So users, forums, threads, PM's, user galleries, etc.
>>
>> Assuming one doesn't have access to the DB from the original site.
>
> We used Perl WWW::Mechanize (http://search.cpan.org/dist/WWW-Mechanize/) to
> write up something similar to Forum Proxy Leacher. I'll try to get
> permission to release it.


If you do go down this path I ♥ the HTML::TreeParser::XPath module[1], which
will parse the HTML into a DOM-like structure, then let you get at the content
with XPath expressions.

I find that extremely powerful in accessing the content in a meaningful
fashion, either through the XPath queries, or through the TreeParser
per-instance objects.

        Daniel

Footnotes: 
[1]  http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.11/

-- 
✣ Daniel Pittman            ✉ [email protected]            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons
   Looking for work?  Love Perl?  In Melbourne, Australia?  We are hiring.
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Re: [SLUG] Invision phpBB Site Content ripping

Reply via email to