Good info Alex. My only comment in response is that if I had 10 million lines of data to process, I wouldn't be doing it in memory, I'd probably be using an SQL database with an appropriate SELECT statement to get only the lines I wanted.
Horses for Courses as they say :-) Pete lcSQL Software <http://www.lcsql.com> Home of lcStackBrowser <http://www.lcsql.com/lcstackbrowser.html> and SQLiteAdmin <http://www.lcsql.com/sqliteadmin.html> On Sat, Aug 30, 2014 at 3:07 PM, Alex Tweedly <a...@tweedly.net> wrote: > Two comments .... > > 1. The two methods below produce different results - the latter one > removes lines that were empty in the original input data, the other doesn't > (unless that happens to match "data condition on line x"). > > 2. In the case of large (or huge) datasets, Hugh's concern about avoiding > memory overhead may be valid - but those cases are exactly the ones where > you most need to be concerned about efficiency. > > If, for example, the input data consists of 10,000,000 lines and perhaps > each line was 100 chars, then we have 1Gb of data. Let's say we retain 50% > of the lines (say every alternate one) - then the "repeat for each" method > adds 1/2 Gb of virtual memory, and requires copying only 1/2Gb of data > during this operation. > > However, although both the methods below don't add the virtual memory > requirement, they copy a LOT of data - each time we delete (or empty) a > line, that causes all the subsequent data to be copied 'in place', so it > will require (approx) 2,500,000 Gb of data copying (5M * 1/2Gb average > remaining data size). So we copy 2.5Pb of data - that's going to cost us a > whole lot more time than any paging needed for 1/2Gb of extra virtual > memory. > > -- Alex. > > > > On 30/08/2014 08:45, FlexibleLearning.com wrote: > >> Peter Haworth <p...@lcsql.com> wrote >> >> There's another situation where I use repeat with even though it's a >>> >> little >> >>> slower than repeat for and I also alter the contents of the data I'm >>> repeating through without any problems. >>> >>> repeat with x=the number of lines in tVar down to to 1 >>> if <data condition on line x of tVar> then >>> delete line x of tVar >>> end if >>> end repeat >>> >> This is an insightful observation. Nice one, Pete! >> >> My stock method (and presumably the method you allude to above) is... >> >> repeat for each line L in tVar >> add 1 to x >> if <data condition on L> then put "" into line x of tVar >> end repeat >> filter tVar without empty >> >> Both methods operate on a single data set and avoid putting the output >> into >> a second variable which, for large datasets, involve an unnecessary memory >> overhead.. >> >> Hugh Senior >> FLCo >> >> >> >> _______________________________________________ >> use-livecode mailing list >> use-livecode@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your >> subscription preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode >> > > > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode > _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode