Re: save tuple of simple data types to disk (low memory foot print)
On 10/29/2011 03:00 AM, Steven D'Aprano wrote: On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote: Hi, I would like to save many dicts with a fixed amount of keys tuples to a file in a memory efficient manner (no random, but only sequential access is required) What do you mean keys tuples? Corrected phrase: I would like to save many dicts with a fixed (and known) amount of keys in a memory efficient manner (no random, but only sequential access is required) to a file (which can later be sent over a slow expensive network to other machines) Example: Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue', 'message1', 'message2' 'timestamp' is an integer 'floatvalue' is a float 'intvalue' an int 'message1' is a string with a length of max 2000 characters, but can often be very short 'message2' the same as message1 so a typical dict will look like { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42, 'message1' : '', 'message2' : '=' * 1999 } What do you call many? Fifty? A thousand? A thousand million? How many items in each dict? Ten? A million? File size can be between 100kb and over 100Mb per file. Files will be accumulated over months. I just want to use the smallest possible space, as the data is collected over a certain time (days / months) and will be transferred via UMTS / EDGE / GSM network, where the transfer takes already for quite small data sets several minutes. I want to reduce the transfer time, when requesting files on demand (and the amount of data in order to not exceed the monthly quota) As the keys are the same for each entry I considered converting them to tuples. I don't even understand what that means. You're going to convert the keys to tuples? What will that accomplish? As the keys are the same for each entry I considered converting them (the before mentioned dicts) to tuples. so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42, 'message1' : '', 'message2' : '=' * 1999 } would become [ 12, 3.14159, 42, '', ''=' * 1999 ] The tuples contain only strings, ints (long ints) and floats (double) and the data types for each position within the tuple are fixed. The fastest and simplest way is to pickle the data or to use json. Both formats however are not that optimal. How big are your JSON files? 10KB? 10MB? 10GB? Have you tried using pickle's space-efficient binary format instead of text format? Try using protocol=2 when you call pickle.Pickler. No. This is probably already a big step forward. As I know the data types if each element in the tuple I would however prefer a representation, which is not storing the data types for each typle over and over again (as they are the same for each dict / tuple) Or have you considered simply compressing the files? Compression makes sense but the inital file format should be already rather 'compact' I could store ints and floats with pack. As strings have variable length I'm not sure how to save them efficiently (except adding a length first and then the string. This isn't 1980 and you're very unlikely to be using 720KB floppies. Premature optimization is the root of all evil. Keep in mind that when you save a file to disk, even if it contains only a single bit of data, the actual space used will be an entire block, which on modern hard drives is very likely to be 4KB. Trying to compress files smaller than a single block doesn't actually save you any space. Is there already some 'standard' way or standard library to store such data efficiently? Yes. Pickle and JSON plus zip or gzip. pickle protocol-2 + gzip of the tuple derived from the dict, might be good enough for the start. I have to create a little more typical data in order to see how many percent of my payload would consist of repeating the data types for each tuple. -- http://mail.python.org/mailman/listinfo/python-list
Re: save tuple of simple data types to disk (low memory foot print)
On 10/29/2011 01:08 AM, Roy Smith wrote: In article mailman.2293.1319834877.27778.python-l...@python.org, Gelonida N gelon...@gmail.com wrote: I would like to save many dicts with a fixed amount of keys tuples to a file in a memory efficient manner (no random, but only sequential access is required) There's two possible scenarios here. One, which you seem to be exploring, is to carefully study your data and figure out the best way to externalize it which reduces volume. The other is to just write it out in whatever form is most convenient (JSON is a reasonable thing to try first), and compress the output. Let the compression algorithms worry about extracting the entropy. You may be surprised at how well it works. It's also an easy experiment to try, so if it doesn't work well, at least it didn't cost you much to find out. Yes I have to make some more tests to see the defference between just compressing aplain format (JSON / pickle) and compressing the 'optimized' representation. -- http://mail.python.org/mailman/listinfo/python-list
Re: save tuple of simple data types to disk (low memory foot print)
On 10/29/11 11:44, Gelonida N wrote: I would like to save many dicts with a fixed (and known) amount of keys in a memory efficient manner (no random, but only sequential access is required) to a file (which can later be sent over a slow expensive network to other machines) Example: Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue', 'message1', 'message2' 'timestamp' is an integer 'floatvalue' is a float 'intvalue' an int 'message1' is a string with a length of max 2000 characters, but can often be very short 'message2' the same as message1 so a typical dict will look like { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42, 'message1' : '', 'message2' : '=' * 1999 } What do you call many? Fifty? A thousand? A thousand million? How many items in each dict? Ten? A million? File size can be between 100kb and over 100Mb per file. Files will be accumulated over months. If Steven's pickle-protocol2 solution doesn't quite do what you need, you can do something like the code below. Gzip is pretty good at addressing... Or have you considered simply compressing the files? Compression makes sense but the inital file format should be already rather 'compact' ...by compressing out a lot of the duplicate aspects. Which also mitigates some of the verbosity of CSV. It serializes the data to a gzipped CSV file then unserializes it. Just point it at the appropriate data-source, adjust the column-names and data-types -tkc from gzip import GzipFile from csv import writer, reader data = [ # use your real data here { 'timestamp': 12, 'floatvalue': 3.14159, 'intvalue': 42, 'message1': 'hello world', 'message2': '=' * 1999, }, ] * 1 f = GzipFile('data.gz', 'wb') try: w = writer(f) for row in data: w.writerow([ row[name] for name in ( # use your real col-names here 'timestamp', 'floatvalue', 'intvalue', 'message1', 'message2', )]) finally: f.close() output = [] for row in reader(GzipFile('data.gz')): d = dict(( (name, f(row[i])) for i, (f,name) in enumerate(( # adjust for your column-names/data-types (int, 'timestamp'), (float, 'floatvalue'), (int, 'intvalue'), (str, 'message1'), (str, 'message2'), output.append(d) # or output = [ dict(( (name, f(row[i])) for i, (f,name) in enumerate(( # adjust for your column-names/data-types (int, 'timestamp'), (float, 'floatvalue'), (int, 'intvalue'), (str, 'message1'), (str, 'message2'), for row in reader(GzipFile('data.gz')) ] -- http://mail.python.org/mailman/listinfo/python-list
Re: save tuple of simple data types to disk (low memory foot print)
In article mailman.2293.1319834877.27778.python-l...@python.org, Gelonida N gelon...@gmail.com wrote: I would like to save many dicts with a fixed amount of keys tuples to a file in a memory efficient manner (no random, but only sequential access is required) There's two possible scenarios here. One, which you seem to be exploring, is to carefully study your data and figure out the best way to externalize it which reduces volume. The other is to just write it out in whatever form is most convenient (JSON is a reasonable thing to try first), and compress the output. Let the compression algorithms worry about extracting the entropy. You may be surprised at how well it works. It's also an easy experiment to try, so if it doesn't work well, at least it didn't cost you much to find out. -- http://mail.python.org/mailman/listinfo/python-list
Re: save tuple of simple data types to disk (low memory foot print)
On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote: Hi, I would like to save many dicts with a fixed amount of keys tuples to a file in a memory efficient manner (no random, but only sequential access is required) What do you call many? Fifty? A thousand? A thousand million? How many items in each dict? Ten? A million? What do you mean keys tuples? As the keys are the same for each entry I considered converting them to tuples. I don't even understand what that means. You're going to convert the keys to tuples? What will that accomplish? The tuples contain only strings, ints (long ints) and floats (double) and the data types for each position within the tuple are fixed. The fastest and simplest way is to pickle the data or to use json. Both formats however are not that optimal. How big are your JSON files? 10KB? 10MB? 10GB? Have you tried using pickle's space-efficient binary format instead of text format? Try using protocol=2 when you call pickle.Pickler. Or have you considered simply compressing the files? I could store ints and floats with pack. As strings have variable length I'm not sure how to save them efficiently (except adding a length first and then the string. This isn't 1980 and you're very unlikely to be using 720KB floppies. Premature optimization is the root of all evil. Keep in mind that when you save a file to disk, even if it contains only a single bit of data, the actual space used will be an entire block, which on modern hard drives is very likely to be 4KB. Trying to compress files smaller than a single block doesn't actually save you any space. Is there already some 'standard' way or standard library to store such data efficiently? Yes. Pickle and JSON plus zip or gzip. -- Steven -- http://mail.python.org/mailman/listinfo/python-list