Re: save tuple of simple data types to disk (low memory foot print)

2011-10-29 Thread Gelonida N
On 10/29/2011 03:00 AM, Steven D'Aprano wrote:
 On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote:
 
 Hi,

 I would like to save many dicts with a fixed amount of keys tuples to a
 file  in a memory efficient manner (no random, but only sequential
 access is required)

 
 What do you mean keys tuples?
Corrected phrase:
I would like to save many dicts with a fixed (and known) amount of keys
in a memory efficient manner (no random, but only sequential access is
required) to a file (which can later be sent over a slow expensive
network to other machines)

Example:
Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
'message1', 'message2'
'timestamp' is an integer
'floatvalue' is a float
'intvalue' an int
'message1' is a string with a length of max 2000 characters, but can
often be very short
'message2' the same as message1

so a typical dict will look like
{ 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
 'message1' : '', 'message2' : '=' * 1999 }



 What do you call many? Fifty? A thousand? A thousand million? How many
 items in each dict? Ten? A million?

File size can be between 100kb and over 100Mb per file. Files will be
accumulated over months.

I just want to use the smallest possible space, as the data is collected
over a certain time (days / months)  and will be transferred  via UMTS /
EDGE / GSM network, where the transfer takes already for quite small
data sets several minutes.

I want to reduce the transfer time, when requesting files on demand (and
the amount of data in order to not exceed the monthly quota)



 As the keys are the same for each entry  I considered converting them to
 tuples.
 
 I don't even understand what that means. You're going to convert the keys 
 to tuples? What will that accomplish?

 As the keys are the same for each entry  I considered converting them
(the before mentioned dicts) to tuples.

so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
 'message1' : '', 'message2' : '=' * 1999 }

would become
[ 12, 3.14159, 42, '', ''=' * 1999 ]
 
 
 The tuples contain only strings, ints (long ints) and floats (double)
 and the data types for each position within the tuple are fixed.

 The fastest and simplest way is to pickle the data or to use json. Both
 formats however are not that optimal.
 
 How big are your JSON files? 10KB? 10MB? 10GB?
 
 Have you tried using pickle's space-efficient binary format instead of 
 text format? Try using protocol=2 when you call pickle.Pickler.

No. This is probably already a big step forward.

As I know the data types if each element in the tuple I would however
prefer a representation, which is not storing the data types for each
typle over and over again (as they are the same for each dict / tuple)

 
 Or have you considered simply compressing the files?

Compression makes sense but the inital file format should be already
rather 'compact'

 
 I could store ints and floats with pack. As strings have variable length
 I'm not sure how to save them efficiently (except adding a length first
 and then the string.
 
 This isn't 1980 and you're very unlikely to be using 720KB floppies. 
 Premature optimization is the root of all evil. Keep in mind that when 
 you save a file to disk, even if it contains only a single bit of data, 
 the actual space used will be an entire block, which on modern hard 
 drives is very likely to be 4KB. Trying to compress files smaller than a 
 single block doesn't actually save you any space.

 
 
 Is there already some 'standard' way or standard library to store such
 data efficiently?
 
 Yes. Pickle and JSON plus zip or gzip.
 

pickle protocol-2 + gzip of the tuple derived from the dict, might be
good enough for the start.

I have to create a little more typical data in order to see how many
percent of my payload would consist of repeating the data types for each
tuple.




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: save tuple of simple data types to disk (low memory foot print)

2011-10-29 Thread Gelonida N
On 10/29/2011 01:08 AM, Roy Smith wrote:
 In article mailman.2293.1319834877.27778.python-l...@python.org,
  Gelonida N gelon...@gmail.com wrote:
 
 I would like to save many dicts with a fixed amount of keys
 tuples to a file  in a memory efficient manner (no random, but only
 sequential access is required)
 
 There's two possible scenarios here.  One, which you seem to be 
 exploring, is to carefully study your data and figure out the best way 
 to externalize it which reduces volume.
 
 The other is to just write it out in whatever form is most convenient 
 (JSON is a reasonable thing to try first), and compress the output.  Let 
 the compression algorithms worry about extracting the entropy.  You may 
 be surprised at how well it works.  It's also an easy experiment to try, 
 so if it doesn't work well, at least it didn't cost you much to find out.


Yes I have to make some more tests to see the defference between
just compressing aplain format (JSON / pickle) and compressing the
'optimized' representation.




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: save tuple of simple data types to disk (low memory foot print)

2011-10-29 Thread Tim Chase

On 10/29/11 11:44, Gelonida N wrote:

I would like to save many dicts with a fixed (and known) amount of keys
in a memory efficient manner (no random, but only sequential access is
required) to a file (which can later be sent over a slow expensive
network to other machines)

Example:
Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
'message1', 'message2'
'timestamp' is an integer
'floatvalue' is a float
'intvalue' an int
'message1' is a string with a length of max 2000 characters, but can
often be very short
'message2' the same as message1

so a typical dict will look like
{ 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
  'message1' : '', 'message2' : '=' * 1999 }




What do you call many? Fifty? A thousand? A thousand million? How many
items in each dict? Ten? A million?


File size can be between 100kb and over 100Mb per file. Files will be
accumulated over months.


If Steven's pickle-protocol2 solution doesn't quite do what you 
need, you can do something like the code below.  Gzip is pretty 
good at addressing...



Or have you considered simply compressing the files?

Compression makes sense but the inital file format should be
already rather 'compact'


...by compressing out a lot of the duplicate aspects.  Which also 
mitigates some of the verbosity of CSV.


It serializes the data to a gzipped CSV file then unserializes 
it.  Just point it at the appropriate data-source, adjust the 
column-names and data-types


-tkc

from gzip import GzipFile
from csv import writer, reader

data = [ # use your real data here
{
'timestamp': 12,
'floatvalue': 3.14159,
'intvalue': 42,
'message1': 'hello world',
'message2': '=' * 1999,
},
] * 1


f = GzipFile('data.gz', 'wb')
try:
w = writer(f)
for row in data:
w.writerow([
row[name] for name in (
# use your real col-names here
'timestamp',
'floatvalue',
'intvalue',
'message1',
'message2',
)])
finally:
f.close()

output = []
for row in reader(GzipFile('data.gz')):
d = dict((
(name, f(row[i]))
for i, (f,name) in enumerate((
# adjust for your column-names/data-types
(int, 'timestamp'),
(float, 'floatvalue'),
(int, 'intvalue'),
(str, 'message1'),
(str, 'message2'),

output.append(d)

# or

output = [
dict((
(name, f(row[i]))
for i, (f,name) in enumerate((
# adjust for your column-names/data-types
(int, 'timestamp'),
(float, 'floatvalue'),
(int, 'intvalue'),
(str, 'message1'),
(str, 'message2'),

for row in reader(GzipFile('data.gz'))
]
--
http://mail.python.org/mailman/listinfo/python-list


Re: save tuple of simple data types to disk (low memory foot print)

2011-10-28 Thread Roy Smith
In article mailman.2293.1319834877.27778.python-l...@python.org,
 Gelonida N gelon...@gmail.com wrote:

 I would like to save many dicts with a fixed amount of keys
 tuples to a file  in a memory efficient manner (no random, but only
 sequential access is required)

There's two possible scenarios here.  One, which you seem to be 
exploring, is to carefully study your data and figure out the best way 
to externalize it which reduces volume.

The other is to just write it out in whatever form is most convenient 
(JSON is a reasonable thing to try first), and compress the output.  Let 
the compression algorithms worry about extracting the entropy.  You may 
be surprised at how well it works.  It's also an easy experiment to try, 
so if it doesn't work well, at least it didn't cost you much to find out.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: save tuple of simple data types to disk (low memory foot print)

2011-10-28 Thread Steven D'Aprano
On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote:

 Hi,
 
 I would like to save many dicts with a fixed amount of keys tuples to a
 file  in a memory efficient manner (no random, but only sequential
 access is required)

What do you call many? Fifty? A thousand? A thousand million? How many 
items in each dict? Ten? A million?

What do you mean keys tuples?


 As the keys are the same for each entry  I considered converting them to
 tuples.

I don't even understand what that means. You're going to convert the keys 
to tuples? What will that accomplish?


 The tuples contain only strings, ints (long ints) and floats (double)
 and the data types for each position within the tuple are fixed.
 
 The fastest and simplest way is to pickle the data or to use json. Both
 formats however are not that optimal.

How big are your JSON files? 10KB? 10MB? 10GB?

Have you tried using pickle's space-efficient binary format instead of 
text format? Try using protocol=2 when you call pickle.Pickler.

Or have you considered simply compressing the files?


 I could store ints and floats with pack. As strings have variable length
 I'm not sure how to save them efficiently (except adding a length first
 and then the string.

This isn't 1980 and you're very unlikely to be using 720KB floppies. 
Premature optimization is the root of all evil. Keep in mind that when 
you save a file to disk, even if it contains only a single bit of data, 
the actual space used will be an entire block, which on modern hard 
drives is very likely to be 4KB. Trying to compress files smaller than a 
single block doesn't actually save you any space.


 Is there already some 'standard' way or standard library to store such
 data efficiently?

Yes. Pickle and JSON plus zip or gzip.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list