Re: [Pytables-users] Some experiences with PyTables
On Dec 6, 2011, at 11:06 PM, Anthony Scopatz wrote: ...snip... 5. The reference manual for numpy contains _many_ small examples. They partially compensate for any lack of precision or excessive precision in the documents. Also many people learn best from examples. If you would like to write up some additional example or contribute to the docs in any way *please* let me know. We would be ecstatic for your help! Anthony, do we have a place in the sphinx docs where cookbook-like examples could just be thrown in? If not, could you set one up? That way someone could push us a PR with just the modified file. ~J. PGP.sig Description: This is a digitally signed message part -- Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Some experiences with PyTables
2011/12/6 Anthony Scopatz scop...@gmail.com This is a function of the underlying HDF5 storage mechanism and not explicitly PyTables. When storing fixed length strings, the array of characters it is converted to *must* be exactly length-N. When serializing a string of length-M, HDF5 does the following: 1. M N: truncate the string at N bytes (chop off the end). 2. M == N: do nothing. 3. M N: pad the character array with N - M null characters to achieve length N. Because of this technique, when deserializing all trailing null characters are dropped. This supports the much more common use case of storing shorter strings in a longer buffer but wanting to only recover the shorter version. What you are saying is correct, except that the 'guilty' of dropping the trailing null characters is NumPy, not HDF5. Look at this: In [27]: import numpy as np In [28]: np.array([aaa]) Out[28]: array(['aaa'], dtype='|S3') In [29]: np.array([aaa\x00\x00]) Out[29]: array(['aaa'], dtype='|S5') Of course, this behaviour for NumPy was discussed long time ago during its introduction (around NumPy 1.0 or so, back in 2006), and people (specially Travis) found this to be the most convenient for the majority of usages. If you are interested in getting the trailing bytes, you can always do: In [53]: a = np.array([aaa\x00\x00]) In [54]: a[0] Out[54]: 'aaa' In [55]: .join([chr(i) for i in a.view('b')]) Out[55]: 'aaa\x00\x00' If you wanted to append null bytes to the end of the string, you could always store the python length (M) in another column. Sure, this is another option. 6. Suppose that the records (key, data1) and (key, data2) are two rows in a table with (key, data1) being a earlier row than (key, data2). Both records have the same value in the first column. If a CSIndex is created using the first column, will (key, data1) still be before (key, data2) in the index? This property is called stability. Some sorting algorithms guarantee this and others don't. Are the sorts in PyTables stable? I am unsure about the stability of the sorts. I defer to Francesc here. Nope, the internal sorts in PyTables use the 'quicksort' algorithm found in NumPy (sort() method), and such algorithm is not stable. For achieving 'stable' sort one should use the 'mergesort' one, but 1) this is slower and 2) uses considerably more memory. 7. The table.append in PyTables behaves like extend in Python. Why? I am assuming you mean Python's list.extend() method. This is likely true because the use case of appending a single row is uncommon and so having another method for it is unnecessary. Also I think you can get list.append() behavior out of the row interface. Yep. I considered having two different 'append' and 'extend' methods to be a bit overlapping, so I decided to stick using just one. 8. I get a mysterious PerformanceWarning from the PyTables file table.py, line 2742. This message needs to be split into two messages. In my case, after I appended to a table, 'row' in self.__dict__ was True and self.row._getUnsavedNrows() was 1. To resolve the problem, I added a line that flushes the table after every append. Does h5file.mytable.flush() do something that h5file.flush() doesn't? Do I need to flush every table after every append or are there only certain situations when this is needed? What does preempted from alive nodes mean? flushing the file will flush the whole file an all nodes. Flushing a node will simply flush that node. The rules for how often you need to flush depend on how much memory you have, how expensive communication to the processor is, etc. Exactly. What this warning is saying is Hey, you are trying to write too much data without flushing, please use flush() from time to time. You can, however, enlarge your cache limits (see http://pytables.github.com/usersguide/parameter_files.html#cache-limits) if you don't like these messages (but do that at your own risk!). 9. Does the following code contain a bug in PyTables? Yes, there is a blosc error we are aware of. Fortunately, it is not an error on Blosc in itself, but in the Blosc HDF5 driver that does not interact well with the 'fletcher32' filter. See: https://github.com/PyTables/PyTables/issues/21 Until this is fixed the solution is to not use fletcher32 (you need it for something special?). Thanks a lot for your constructive feedback. Seriously, if there are things which you would like to see changed in PyTables and want to help out, we try to be very receptive. Pull requests on github are a great way to get our attention! Also, please take some time to subscribe to the list if you want to avoid Josh resending your posts every time you write. Thanks. -- Francesc Alted -- Cloud Services Checklist: Pricing and Packaging Optimization This white paper is
Re: [Pytables-users] Some experiences with PyTables
2011/12/7 Francesc Alted fal...@pytables.org What you are saying is correct, except that the 'guilty' of dropping the trailing null characters is NumPy, not HDF5. Look at this: In [27]: import numpy as np In [28]: np.array([aaa]) Out[28]: array(['aaa'], dtype='|S3') In [29]: np.array([aaa\x00\x00]) Out[29]: array(['aaa'], dtype='|S5') Of course, this behaviour for NumPy was discussed long time ago during its introduction (around NumPy 1.0 or so, back in 2006), and people (specially Travis) found this to be the most convenient for the majority of usages. If you are interested in getting the trailing bytes, you can always do: In [53]: a = np.array([aaa\x00\x00]) In [54]: a[0] Out[54]: 'aaa' In [55]: .join([chr(i) for i in a.view('b')]) Out[55]: 'aaa\x00\x00' Hmm, in this case the element to convert is a np.string_ and not a ndarray, but the solution to get the trailing nulls is even easier: In [70]: a = np.string_('aaa\x00\x00') In [71]: a Out[71]: 'aaa' In [72]: a.data[:] Out[72]: 'aaa\x00\x00' -- Francesc Alted -- Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Some experiences with PyTables
On Wed, Dec 7, 2011 at 5:51 AM, Josh Moore josh.mo...@gmx.de wrote: On Dec 6, 2011, at 11:06 PM, Anthony Scopatz wrote: ...snip... 5. The reference manual for numpy contains _many_ small examples. They partially compensate for any lack of precision or excessive precision in the documents. Also many people learn best from examples. If you would like to write up some additional example or contribute to the docs in any way *please* let me know. We would be ecstatic for your help! Anthony, do we have a place in the sphinx docs where cookbook-like examples could just be thrown in? If not, could you set one up? That way someone could push us a PR with just the modified file. I don't think that there is anything like that right now, but I'll try to set something up along these lines. Be Well Anthony ~J. -- Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Some experiences with PyTables
Hello Edward, I'd like to respond point by point: On Tue, Dec 6, 2011 at 2:54 PM, PyTables Org pytab...@googlemail.comwrote: 1. There seems to be an unpythonic design choice with the start, stop, step convention for PyTables. Anything that is unnatural to a Python programmer should be heavily documented. Agreed in general. Do you have specific example we could address? 2. There may be a bug in itersorted. Yes, this looks like a bug. This deserves an issue on github... Here is code for (1) and (2): #! /usr/bin/env python import random, tables h5file = tables.openFile('mytable.h5', mode='w') class silly_class(tables.IsDescription): num = tables.Int32Col(pos=0) mytable = h5file.createTable(h5file.root, 'mytable', silly_class, 'a few ints', expectedrows=4) row = mytable.row for i in range(10): row['num'] = random.randint(0, 99) row.append() mytable.flush() mytable.cols.num.createCSIndex() # Python's idiom for start, stop, step: print 'Python:', range(9, -1, -1) output = mytable.readSorted('num', start=0, stop=10, step=-1) print 'readSorted:', 0, 10, -1, output # copy supports a negative step. It seems that start and stop are applied # _after_ the sort is done. Very unlike Python. Please document thoroughly. We could certainly add some text to the docstring of Table.copy(). Still, I guess I am missing how this is 'wrong.' To the best of my knowledge, Python itself has no single function which both sorts and slices. (Please correct me if I am wrong ~_~.) When performing both operations one needs to be done first. However, you are correct in that this could be better documented. print h5file.root.mytable[:] h5file.root.mytable.copy(h5file.root, 'mytable2', sortby='num', start=0, stop=5, step=-1) print h5file.root.mytable2[:] # The following raises an OverflowError. The documentation (2.3.1) says # negative steps are supported for itersorted. Documentation error or bug # in itersorted? output = [x['num'] for x in mytable.itersorted('num', start=0, stop=10, step=-1)] print 'itersorted:', 0, 10, -1, output 3. Null bytes are stripped from the end of strings when they are stored in a table. Since a Python does not expect this, it needs to be explicitly documented in all the relevant places. Here is some code: This is a function of the underlying HDF5 storage mechanism and not explicitly PyTables. When storing fixed length strings, the array of characters it is converted to *must* be exactly length-N. When serializing a string of length-M, HDF5 does the following: 1. M N: truncate the string at N bytes (chop off the end). 2. M == N: do nothing. 3. M N: pad the character array with N - M null characters to achieve length N. Because of this technique, when deserializing all trailing null characters are dropped. This supports the much more common use case of storing shorter strings in a longer buffer but wanting to only recover the shorter version. If you wanted to append null bytes to the end of the string, you could always store the python length (M) in another column. #! /usr/bin/env python import tables def hash2hex(stringin): out = list() for c in stringin: s = hex(ord(c))[2:] if len(s) == 1: s = '0' + s out.append(s) return ''.join(out) h5file = tables.openFile('mytable.h5', mode='w') class silly_class(tables.IsDescription): astring = tables.StringCol(16, pos=0) mytable = h5file.createTable(h5file.root, 'mytable', silly_class, 'a few strings', expectedrows=4) # Problem when string ends with null bytes: nasty = 'abdcef' + '\x00\x00' print repr(nasty) print hash2hex(nasty) row = mytable.row row['astring'] = nasty row.append() mytable.flush() print repr(mytable[0][0]) print hash2hex(mytable[0][0]) h5file.close() 4. Has the 64K limit for attributes been lifted? No, unfortunately. Once again, this is a compile time parameter of HDF5. You could change this value and recompile HDF5, but then any h5 file you create would not be portable with other versions of HDF5. Trust me, you are not the only one who wishes this were as run-time variable. (Still there are good reasons for it being static, ie speed and size) 5. The reference manual for numpy contains _many_ small examples. They partially compensate for any lack of precision or excessive precision in the documents. Also many people learn best from examples. If you would like to write up some additional example or contribute to the docs in any way *please* let me know. We would be ecstatic for your help! 6. Suppose that the records (key, data1) and (key, data2) are two rows in a table with (key, data1) being a earlier row than (key, data2). Both records have the same value in the first column. If a CSIndex is created using the first column, will (key, data1) still be before (key, data2) in