Re: Website scraping - How can I load a 'partial' page?
Hmm, or use range as mentioned in my other mail. If the server supports range requests you can set your headers to include-- Range: bytes=0-2000to get the first 2000 bytes. or use curl with -r 0-2000 but i have yet to find a page that will return only a range. Apparently you can find out if a page will accept ranges using curl with something like this.. curl -I http://i.imgur.com/z4d4kWk.jpg HTTP/1.1 200 OK ... Accept-Ranges: bytes Content-Length: 146515 if it has "Accept=Ranges: bytes" as part of the response, it should work. I'm still thinking the intermediary method is best. On Wed, Dec 13, 2017 at 8:39 AM, Mike Bonnerwrote: > I suppose one could use sockets and partial GET requests (using a range: > header), but i suspect it would be easier to just use an intermediary > server to handle things. To test, I set up an extremely simple page with > the following: > > put $_GET["page"] into tPage -- a get request TO my pageof the form ?page= > http://url.goes.here > put char 1 to 6000 of url tpage -- request the page to be scraped and > return the first 6000 chars > > ?> > To use this is a simple-- get URL "http://path.to.my.page.com/ > scrape.lc?page=http://server.to.scrape.com/pagetoscrape.html; > > if the page to be scraped uses a get style request, it will might be > better to use post instead. > > In this way you can use a server on a hot connect to do the heavy lifting > and then just send the results back down. In fact, you could probably have > the server itself do the scraping and just return any final results (or pop > the results into a database or whatever) Also in fact, if you have enough > control of the server, and need to scrape the same page over and over for > changes you could most likely set up a cronjob to do the work and a front > end to pull the results. (don't know what your final objective is, so hard > to say whats best) > > > > On Wed, Dec 13, 2017 at 6:39 AM, Roger Eller via use-livecode < > use-livecode@lists.runrev.com> wrote: > >> I have a webpage that I grab with LiveCode, then parse out what I need. >> The data I keep is within the first 1/4th of the page. >> >> Rather than loading the entire page into a variable or a browser object, >> how can I load just the portion that I need and then stop the transmission >> instead of wasting the time and bandwidth to load the entire page? >> >> ~Roger >> ___ >> use-livecode mailing list >> use-livecode@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your >> subscription preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode >> > > ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Website scraping - How can I load a 'partial' page?
Hi Roger, I don’t know who’s webpage is that you are scraping, but if it is a third party’s webpage make sure that you are not violating their terms of agreement or infringing on their copyright. You might want to ask for their permission to do so, to make sure you are safe and legal. If it is your own webpage, then feel perfectly at ease scraping away. Cheers, Rick > On Dec 13, 2017, at 8:39 AM, Roger Eller via use-livecode >wrote: > > I have a webpage that I grab with LiveCode, then parse out what I need. > The data I keep is within the first 1/4th of the page. > > Rather than loading the entire page into a variable or a browser object, > how can I load just the portion that I need and then stop the transmission > instead of wasting the time and bandwidth to load the entire page? > > ~Roger > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Website scraping - How can I load a 'partial' page?
I suppose one could use sockets and partial GET requests (using a range: header), but i suspect it would be easier to just use an intermediary server to handle things. To test, I set up an extremely simple page with the following: http://url.goes.here put char 1 to 6000 of url tpage -- request the page to be scraped and return the first 6000 chars ?> To use this is a simple-- get URL " http://path.to.my.page.com/scrape.lc?page=http://server.to.scrape.com/pagetoscrape.html " if the page to be scraped uses a get style request, it will might be better to use post instead. In this way you can use a server on a hot connect to do the heavy lifting and then just send the results back down. In fact, you could probably have the server itself do the scraping and just return any final results (or pop the results into a database or whatever) Also in fact, if you have enough control of the server, and need to scrape the same page over and over for changes you could most likely set up a cronjob to do the work and a front end to pull the results. (don't know what your final objective is, so hard to say whats best) On Wed, Dec 13, 2017 at 6:39 AM, Roger Eller via use-livecode < use-livecode@lists.runrev.com> wrote: > I have a webpage that I grab with LiveCode, then parse out what I need. > The data I keep is within the first 1/4th of the page. > > Rather than loading the entire page into a variable or a browser object, > how can I load just the portion that I need and then stop the transmission > instead of wasting the time and bandwidth to load the entire page? > > ~Roger > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode > ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Website scraping - How can I load a 'partial' page?
I have a webpage that I grab with LiveCode, then parse out what I need. The data I keep is within the first 1/4th of the page. Rather than loading the entire page into a variable or a browser object, how can I load just the portion that I need and then stop the transmission instead of wasting the time and bandwidth to load the entire page? ~Roger ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode