Re: which is faster for searching?

Peter Haworth Sat, 30 Aug 2014 15:18:46 -0700

Good info Alex.

My only comment in response is that if I had 10 million lines of data to
process, I wouldn't be doing it in memory, I'd probably be using an SQL
database with an appropriate SELECT statement to get only the lines I
wanted.


Horses for Courses as they say :-)

Pete
lcSQL Software <http://www.lcsql.com>
Home of lcStackBrowser <http://www.lcsql.com/lcstackbrowser.html> and
SQLiteAdmin <http://www.lcsql.com/sqliteadmin.html>


On Sat, Aug 30, 2014 at 3:07 PM, Alex Tweedly <a...@tweedly.net> wrote:

> Two comments ....
>
> 1. The two methods below produce different results - the latter one
> removes lines that were empty in the original input data, the other doesn't
> (unless that happens to match "data condition on line x").
>
> 2. In the case of large (or huge) datasets, Hugh's concern about avoiding
> memory overhead may be valid - but those cases are exactly the ones where
> you most need to be concerned about efficiency.
>
> If, for example, the input data consists of 10,000,000 lines and perhaps
> each line was 100 chars, then we have 1Gb of data. Let's say we retain 50%
> of the lines (say every alternate one) - then the "repeat for each" method
> adds 1/2 Gb of virtual memory, and requires copying only 1/2Gb of data
> during this operation.
>
> However, although both the methods below don't add the virtual memory
> requirement, they copy a LOT of data - each time we delete (or empty) a
> line, that causes all the subsequent data to be copied 'in place', so it
> will require (approx) 2,500,000 Gb of data copying (5M * 1/2Gb average
> remaining data size). So we copy 2.5Pb of data - that's going to cost us a
> whole lot more time than any paging needed for 1/2Gb of extra virtual
> memory.
>
> -- Alex.
>
>
>
> On 30/08/2014 08:45, FlexibleLearning.com wrote:
>
>> Peter Haworth <p...@lcsql.com> wrote
>>
>>  There's another situation where I use repeat with even though it's a
>>>
>> little
>>
>>> slower than repeat for and I also alter the contents of the data I'm
>>> repeating through without any problems.
>>>
>>> repeat with x=the number of lines in tVar down to to 1
>>>     if <data condition on  line x of tVar> then
>>>        delete line x of tVar
>>>     end if
>>> end repeat
>>>
>> This is an insightful observation. Nice one, Pete!
>>
>> My stock method (and presumably the method you allude to above) is...
>>
>> repeat for each line L in tVar
>>      add 1 to x
>>      if <data condition on  L> then put "" into line x of tVar
>> end repeat
>> filter tVar without empty
>>
>> Both methods operate on a single data set and avoid putting the output
>> into
>> a second variable which, for large datasets, involve an unnecessary memory
>> overhead..
>>
>> Hugh Senior
>> FLCo
>>
>>
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: which is faster for searching?

Reply via email to