Re: [backstage] Screen Scraping Advice ...

2006-07-26 Thread Scot McSweeney-Roberts




Murray, Simon (IED) wrote:

  
  
  
  
  
  Some
time ago I wrote a simple screen scrape script in classic ASP using the
Internet Transfer Protocol (InetCtls.Inet) which had it's limitations. I'm interested
in using .Net and the HttpWebRequest class, but would welcome any
guidance on the subjectparticularly when accessing data spanning
across multiple pages.
  
  


Again, a not .Net answer - I've successfully used Python with
Beautiful Soup - http://www.crummy.com/software/BeautifulSoup/ 

Saying that, there is a version of Python for .Net called IronPython,
so maybe you can get Beautiful Soup to work with that

http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython

cheers

Scot






Re: [backstage] Screen Scraping Advice ...

2006-07-26 Thread Matthew Somerville

Murray, Simon (IED) wrote:
I'm interested in using .Net and the HttpWebRequest class, 
but would welcome any guidance on the subject particularly when 
accessing data spanning across multiple pages.


http://www.crummy.com/software/BeautifulSoup/ might be useful? I've heard 
good things about it.


Our parser of Hansard (for which we have a licence, I should point out) has 
to cope with things spanning pages. It used to just look for the Next 
Section link and follow that until they stopped, but these are occasionally 
missing, so it now stores all the links from an index page, starts following 
Next Section links and hopefully works out what to do if one is missing.

--
ATB,
Matthew

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


[backstage] Screen Scraping Advice ...

2006-07-24 Thread Murray, Simon \(IED\)




Hi 
All

Please forgive this 
off-topic post, but I am working a project which requires screen-scraping a 
variety of data from several third party websites and, after countless hours on 
google, am looking to be pointed in the right direction ... and I can't think of 
a more informed group of individuals to ask for assistance (creep, 
creep).

Some time ago I 
wrote a simple screen scrape script in classic ASP using the Internet Transfer 
Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net 
and the HttpWebRequest class, but would welcome any guidance on the 
subjectparticularly when accessing data spanning across multiple 
pages.

Thanks in 
advance
Simon







This is not an offer (or solicitation of an offer) to buy/sell the securities/instruments mentioned or an official confirmation. Morgan Stanley may deal as principal in or own or act as market maker for securities/instruments mentioned or may advise the issuers. This is not research and is not from MS Research but it may refer to a research analyst/research report. Unless indicated, these views are the authors and may differ from those of Morgan Stanley research or others in the Firm. We do not represent this is accurate or complete and we may not update this. Past performance is not indicative of future returns. For additional information, research reports and important disclosures, contact me or see https://secure.ms.com/servlet/cls. You should not use e-mail to request, authorize or effect the purchase or sale of any security or instrument, to send transfer instructions, or to effect any other transactions. We cannot guarantee that any such requests received via e-mail will be processed in a timely manner. This communication is solely for the addressee(s) and may contain confidential information. We do not waive confidentiality by mistransmission. Contact me if you do not wish to receive these communications. In the UK, this communication is directed in the UK to those persons who are market counterparties or intermediate customers (as defined in the UK Financial Services Authoritys rules).


Re: [backstage] Screen Scraping Advice ...

2006-07-24 Thread Mario Menti
Not .Net, and haven't used it in a while, but I used Perl's WWW-Mechanize (http://search.cpan.org/dist/WWW-Mechanize/) sucessfully in the past, going across multiple pages, form submits etc..
Just in case it helps..Mario.On 7/24/06, Murray, Simon (IED) [EMAIL PROTECTED]
 wrote:








Hi 
All

Please forgive this 
off-topic post, but I am working a project which requires screen-scraping a 
variety of data from several third party websites and, after countless hours on 
google, am looking to be pointed in the right direction ... and I can't think of 
a more informed group of individuals to ask for assistance (creep, 
creep).

Some time ago I 
wrote a simple screen scrape script in classic ASP using the Internet Transfer 
Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net 
and the HttpWebRequest class, but would welcome any guidance on the 
subjectparticularly when accessing data spanning across multiple 
pages.

Thanks in 
advance
Simon







This is not an offer (or solicitation of an offer) to buy/sell the securities/instruments mentioned or an official confirmation. Morgan Stanley may deal as principal in or own or act as market maker for securities/instruments mentioned or may advise the issuers. 
This is not research and is not from MS Research but it may refer to a research analyst/research report. Unless indicated, these views are the author's and may differ from those of Morgan Stanley research or others in the Firm. We do not represent this is accurate or complete and we may not update this. Past performance is not indicative of future returns. For additional information, research reports and important disclosures, contact me or see 
https://secure.ms.com/servlet/cls
. You should not use e-mail to request, authorize or effect the purchase or sale of any security or instrument, to send transfer instructions, or to effect any other transactions. We cannot guarantee that any such requests received via e-mail will be processed in a timely manner. This communication is solely for the addressee(s) and may contain confidential information. We do not waive confidentiality by mistransmission. Contact me if you do not wish to receive these communications. In the UK, this communication is directed in the UK to those persons who are market counterparties or intermediate customers (as defined in the UK Financial Services Authority's rules).





Re: [backstage] Screen Scraping Advice ...

2006-07-24 Thread Nic James Ferrier
Murray, Simon \(IED\) [EMAIL PROTECTED] writes:

 Hi All
  
 Please forgive this off-topic post, but I am working a project which
 requires screen-scraping a variety of data from several third party
 websites and, after countless hours on google, am looking to be pointed
 in the right direction ... and I can't think of a more informed group of
 individuals to ask for assistance (creep, creep).
  
 Some time ago I wrote a simple screen scrape script in classic ASP using
 the Internet Transfer Protocol (InetCtls.Inet) which had it's
 limitations. I'm interested in using .Net and the HttpWebRequest class,
 but would welcome any guidance on the subject particularly when
 accessing data spanning across multiple pages.

I've done quite a bit of screen scraping using just XSLT.

My favoured XSLT is libxslt which sits on top of libxml2 (mainly for
linux but you can get it for the other operating systems).

libxslt does massaging of HTML to well formed XML. All XSLT engines
will allow you to directly read HTML over HTTP if you want to and do
so inside the XSLT process, eg: like this:

  xsl:variable name=aunty select=document('http://www.bbc.co.uk')/


One technique I use a lot is to XSLT a file into a CSV format and then
process it further with simple shell tools. This means identifying the
various fields that you want to form a record. Don't worry about
normalization - it doesn't matter for tasks like this.


When I finally have a tool that I like I tend to then parcel it up
replacing the shell bits with a real programming language but
keeping the XSLT.


The most frustrating part of any screen scrape with XSLT is actually
finding the xpath of the point in the document where you're gonna
extract from.

This is hard.

It's best when the page provider has tagged the data you want with a
class or something. That actually happens quite a lot now because
people are using them for CSS. The designer's CSS gain can also be the
screen scrapers gain.

Indeed, as soon as things are classes semantically, I would argue that
it's not screen scraping anymore. It's just data processing.



One other thing: I would definitely try and separate the processing
from the presentation. The processing can fail for all sorts of
reasons... you don't want requests for the presentation to be causing
failures all the time... you want to be in control of the scrape.

Of course, that isn't always possible. I wrote a screen scrape login
authenticator for Natwest's online banking app. Clearly one wouldn't
want to cache anything with that  /8-


-- 
Nic Ferrier
http://www.tapsellferrier.co.uk   for all your tapsell ferrier needs
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Screen Scraping Advice ...

2006-07-24 Thread dave miller

Hi Simon,

I know a talented designer/ programmer who's really great at this:

Kalle Kormann [EMAIL PROTECTED]

He was a fellow student on the 'networked media' MA we both did last
year, and I'm sure he'd be happy to help you.

cheers, dave


-- Forwarded message --
From: Murray, Simon (IED) [EMAIL PROTECTED]
Date: 24-Jul-2006 16:48
Subject: [backstage] Screen Scraping Advice ...
To: backstage@lists.bbc.co.uk




Hi All

Please forgive this off-topic post, but I am working a project which
requires screen-scraping a variety of data from several third party
websites and, after countless hours on google, am looking to be
pointed in the right direction ... and I can't think of a more
informed group of individuals to ask for assistance (creep, creep).

Some time ago I wrote a simple screen scrape script in classic ASP
using the Internet Transfer Protocol (InetCtls.Inet) which had it's
limitations. I'm interested in using .Net and the HttpWebRequest
class, but would welcome any guidance on the subject particularly when
accessing data spanning across multiple pages.

Thanks in advance
Simon





This is not an offer (or solicitation of an offer) to buy/sell the
securities/instruments mentioned or an official confirmation.  Morgan
Stanley may deal as principal in or own or act as market maker for
securities/instruments mentioned or may advise the issuers.  This is
not research and is not from MS Research but it may refer to a
research analyst/research report.  Unless indicated, these views are
the author's and may differ from those of Morgan Stanley research or
others in the Firm.  We do not represent this is accurate or complete
and we may not update this.  Past performance is not indicative of
future returns.  For additional information, research reports and
important disclosures, contact me or see
https://secure.ms.com/servlet/cls.  You should not use e-mail to
request, authorize or effect the purchase or sale of any security or
instrument, to send transfer instructions, or to effect any other
transactions.  We cannot guarantee that any such requests received via
e-mail will be processed in a timely manner.  This communication is
solely for the addressee(s) and may contain confidential information.
We do not waive confidentiality by mistransmission.  Contact me if you
do not wish to receive these communications.  In the UK, this
communication is directed in the UK to those persons who are market
counterparties or intermediate customers (as defined in the UK
Financial Services Authority's rules).
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Screen Scraping Advice ...

2006-07-24 Thread Oliver Jackson

Mario Menti wrote:
Not .Net, and haven't used it in a while, but I used Perl's 
WWW-Mechanize (http://search.cpan.org/dist/WWW-Mechanize/) sucessfully 
in the past, going across multiple pages, form submits etc..


Just in case it helps..
Mario.

I concur with Mario on this one. I've used WWW-Mechanize to do some 
kerazy stuff involving submits, cookies, referrer spoofing etc. It's 
pretty good.


Olly

--
http://ollyjackson.co.uk

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/