Re: [backstage] Screen Scraping Advice ...
Murray, Simon (IED) wrote: Some time ago I wrote a simple screen scrape script in classic ASP using the Internet Transfer Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net and the HttpWebRequest class, but would welcome any guidance on the subjectparticularly when accessing data spanning across multiple pages. Again, a not .Net answer - I've successfully used Python with Beautiful Soup - http://www.crummy.com/software/BeautifulSoup/ Saying that, there is a version of Python for .Net called IronPython, so maybe you can get Beautiful Soup to work with that http://www.codeplex.com/Wiki/View.aspx?ProjectName=IronPython cheers Scot
Re: [backstage] Screen Scraping Advice ...
Murray, Simon (IED) wrote: I'm interested in using .Net and the HttpWebRequest class, but would welcome any guidance on the subject particularly when accessing data spanning across multiple pages. http://www.crummy.com/software/BeautifulSoup/ might be useful? I've heard good things about it. Our parser of Hansard (for which we have a licence, I should point out) has to cope with things spanning pages. It used to just look for the Next Section link and follow that until they stopped, but these are occasionally missing, so it now stores all the links from an index page, starts following Next Section links and hopefully works out what to do if one is missing. -- ATB, Matthew - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
[backstage] Screen Scraping Advice ...
Hi All Please forgive this off-topic post, but I am working a project which requires screen-scraping a variety of data from several third party websites and, after countless hours on google, am looking to be pointed in the right direction ... and I can't think of a more informed group of individuals to ask for assistance (creep, creep). Some time ago I wrote a simple screen scrape script in classic ASP using the Internet Transfer Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net and the HttpWebRequest class, but would welcome any guidance on the subjectparticularly when accessing data spanning across multiple pages. Thanks in advance Simon This is not an offer (or solicitation of an offer) to buy/sell the securities/instruments mentioned or an official confirmation. Morgan Stanley may deal as principal in or own or act as market maker for securities/instruments mentioned or may advise the issuers. This is not research and is not from MS Research but it may refer to a research analyst/research report. Unless indicated, these views are the authors and may differ from those of Morgan Stanley research or others in the Firm. We do not represent this is accurate or complete and we may not update this. Past performance is not indicative of future returns. For additional information, research reports and important disclosures, contact me or see https://secure.ms.com/servlet/cls. You should not use e-mail to request, authorize or effect the purchase or sale of any security or instrument, to send transfer instructions, or to effect any other transactions. We cannot guarantee that any such requests received via e-mail will be processed in a timely manner. This communication is solely for the addressee(s) and may contain confidential information. We do not waive confidentiality by mistransmission. Contact me if you do not wish to receive these communications. In the UK, this communication is directed in the UK to those persons who are market counterparties or intermediate customers (as defined in the UK Financial Services Authoritys rules).
Re: [backstage] Screen Scraping Advice ...
Not .Net, and haven't used it in a while, but I used Perl's WWW-Mechanize (http://search.cpan.org/dist/WWW-Mechanize/) sucessfully in the past, going across multiple pages, form submits etc.. Just in case it helps..Mario.On 7/24/06, Murray, Simon (IED) [EMAIL PROTECTED] wrote: Hi All Please forgive this off-topic post, but I am working a project which requires screen-scraping a variety of data from several third party websites and, after countless hours on google, am looking to be pointed in the right direction ... and I can't think of a more informed group of individuals to ask for assistance (creep, creep). Some time ago I wrote a simple screen scrape script in classic ASP using the Internet Transfer Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net and the HttpWebRequest class, but would welcome any guidance on the subjectparticularly when accessing data spanning across multiple pages. Thanks in advance Simon This is not an offer (or solicitation of an offer) to buy/sell the securities/instruments mentioned or an official confirmation. Morgan Stanley may deal as principal in or own or act as market maker for securities/instruments mentioned or may advise the issuers. This is not research and is not from MS Research but it may refer to a research analyst/research report. Unless indicated, these views are the author's and may differ from those of Morgan Stanley research or others in the Firm. We do not represent this is accurate or complete and we may not update this. Past performance is not indicative of future returns. For additional information, research reports and important disclosures, contact me or see https://secure.ms.com/servlet/cls . You should not use e-mail to request, authorize or effect the purchase or sale of any security or instrument, to send transfer instructions, or to effect any other transactions. We cannot guarantee that any such requests received via e-mail will be processed in a timely manner. This communication is solely for the addressee(s) and may contain confidential information. We do not waive confidentiality by mistransmission. Contact me if you do not wish to receive these communications. In the UK, this communication is directed in the UK to those persons who are market counterparties or intermediate customers (as defined in the UK Financial Services Authority's rules).
Re: [backstage] Screen Scraping Advice ...
Murray, Simon \(IED\) [EMAIL PROTECTED] writes: Hi All Please forgive this off-topic post, but I am working a project which requires screen-scraping a variety of data from several third party websites and, after countless hours on google, am looking to be pointed in the right direction ... and I can't think of a more informed group of individuals to ask for assistance (creep, creep). Some time ago I wrote a simple screen scrape script in classic ASP using the Internet Transfer Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net and the HttpWebRequest class, but would welcome any guidance on the subject particularly when accessing data spanning across multiple pages. I've done quite a bit of screen scraping using just XSLT. My favoured XSLT is libxslt which sits on top of libxml2 (mainly for linux but you can get it for the other operating systems). libxslt does massaging of HTML to well formed XML. All XSLT engines will allow you to directly read HTML over HTTP if you want to and do so inside the XSLT process, eg: like this: xsl:variable name=aunty select=document('http://www.bbc.co.uk')/ One technique I use a lot is to XSLT a file into a CSV format and then process it further with simple shell tools. This means identifying the various fields that you want to form a record. Don't worry about normalization - it doesn't matter for tasks like this. When I finally have a tool that I like I tend to then parcel it up replacing the shell bits with a real programming language but keeping the XSLT. The most frustrating part of any screen scrape with XSLT is actually finding the xpath of the point in the document where you're gonna extract from. This is hard. It's best when the page provider has tagged the data you want with a class or something. That actually happens quite a lot now because people are using them for CSS. The designer's CSS gain can also be the screen scrapers gain. Indeed, as soon as things are classes semantically, I would argue that it's not screen scraping anymore. It's just data processing. One other thing: I would definitely try and separate the processing from the presentation. The processing can fail for all sorts of reasons... you don't want requests for the presentation to be causing failures all the time... you want to be in control of the scrape. Of course, that isn't always possible. I wrote a screen scrape login authenticator for Natwest's online banking app. Clearly one wouldn't want to cache anything with that /8- -- Nic Ferrier http://www.tapsellferrier.co.uk for all your tapsell ferrier needs - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Screen Scraping Advice ...
Hi Simon, I know a talented designer/ programmer who's really great at this: Kalle Kormann [EMAIL PROTECTED] He was a fellow student on the 'networked media' MA we both did last year, and I'm sure he'd be happy to help you. cheers, dave -- Forwarded message -- From: Murray, Simon (IED) [EMAIL PROTECTED] Date: 24-Jul-2006 16:48 Subject: [backstage] Screen Scraping Advice ... To: backstage@lists.bbc.co.uk Hi All Please forgive this off-topic post, but I am working a project which requires screen-scraping a variety of data from several third party websites and, after countless hours on google, am looking to be pointed in the right direction ... and I can't think of a more informed group of individuals to ask for assistance (creep, creep). Some time ago I wrote a simple screen scrape script in classic ASP using the Internet Transfer Protocol (InetCtls.Inet) which had it's limitations. I'm interested in using .Net and the HttpWebRequest class, but would welcome any guidance on the subject particularly when accessing data spanning across multiple pages. Thanks in advance Simon This is not an offer (or solicitation of an offer) to buy/sell the securities/instruments mentioned or an official confirmation. Morgan Stanley may deal as principal in or own or act as market maker for securities/instruments mentioned or may advise the issuers. This is not research and is not from MS Research but it may refer to a research analyst/research report. Unless indicated, these views are the author's and may differ from those of Morgan Stanley research or others in the Firm. We do not represent this is accurate or complete and we may not update this. Past performance is not indicative of future returns. For additional information, research reports and important disclosures, contact me or see https://secure.ms.com/servlet/cls. You should not use e-mail to request, authorize or effect the purchase or sale of any security or instrument, to send transfer instructions, or to effect any other transactions. We cannot guarantee that any such requests received via e-mail will be processed in a timely manner. This communication is solely for the addressee(s) and may contain confidential information. We do not waive confidentiality by mistransmission. Contact me if you do not wish to receive these communications. In the UK, this communication is directed in the UK to those persons who are market counterparties or intermediate customers (as defined in the UK Financial Services Authority's rules). - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Screen Scraping Advice ...
Mario Menti wrote: Not .Net, and haven't used it in a while, but I used Perl's WWW-Mechanize (http://search.cpan.org/dist/WWW-Mechanize/) sucessfully in the past, going across multiple pages, form submits etc.. Just in case it helps.. Mario. I concur with Mario on this one. I've used WWW-Mechanize to do some kerazy stuff involving submits, cookies, referrer spoofing etc. It's pretty good. Olly -- http://ollyjackson.co.uk - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/