Re: [Pytables-users] Some experiences with PyTables

2011-12-07 Thread Josh Moore

On Dec 6, 2011, at 11:06 PM, Anthony Scopatz wrote:

...snip...

 5. The reference manual for numpy contains _many_ small examples.  They
 partially compensate for any lack of precision or excessive precision in
 the documents.  Also many people learn best from examples.
 
 
 If you would like to write up some additional example or contribute to the
 docs in any way
 *please* let me know.  We would be ecstatic for your help!

Anthony,

do we have a place in the sphinx docs where cookbook-like examples could just 
be thrown in? If not, could you set one up? That way someone could push us a PR 
with just the modified file.

~J.

 



PGP.sig
Description: This is a digitally signed message part
--
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Some experiences with PyTables

2011-12-07 Thread Francesc Alted
2011/12/6 Anthony Scopatz scop...@gmail.com

 This is a function of the underlying HDF5 storage mechanism and not
 explicitly PyTables.
 When storing fixed length strings, the array of characters it is converted
 to *must* be exactly
 length-N.  When serializing a string of length-M, HDF5 does the following:

 1.  M  N: truncate the string at N bytes (chop off the end).
 2. M == N: do nothing.
 3.  M  N:  pad the character array with N - M null characters to achieve
 length N.

 Because of this technique, when deserializing all trailing null characters
 are dropped.
 This supports the much more common use case of storing shorter strings in
 a longer
 buffer but wanting to only recover the shorter version.


What you are saying is correct, except that the 'guilty' of dropping the
trailing null characters is NumPy, not HDF5.  Look at this:

In [27]: import numpy as np

In [28]: np.array([aaa])
Out[28]:
array(['aaa'],
  dtype='|S3')

In [29]: np.array([aaa\x00\x00])
Out[29]:
array(['aaa'],
  dtype='|S5')

Of course, this behaviour for NumPy was discussed long time ago during its
introduction (around NumPy 1.0 or so, back in 2006), and people (specially
Travis) found this to be the most convenient for the majority of usages.
If you are interested in getting the trailing bytes, you can always do:

In [53]: a = np.array([aaa\x00\x00])

In [54]: a[0]
Out[54]: 'aaa'

In [55]: .join([chr(i) for i in a.view('b')])
Out[55]: 'aaa\x00\x00'



 If you wanted to append null bytes to the end of the string, you could
 always store the
 python length (M) in another column.


Sure, this is another option.


 6. Suppose that the records (key, data1) and (key, data2) are two rows in a
   table with (key, data1) being a earlier row than (key, data2).  Both
   records have the same value in the first column.  If a CSIndex is created
   using the first column, will (key, data1) still be before (key, data2) in
   the index?  This property is called stability.  Some sorting algorithms
   guarantee this and others don't.  Are the sorts in PyTables stable?

 I am unsure about the stability of the sorts.  I defer to Francesc here.


Nope, the internal sorts in PyTables use the 'quicksort' algorithm found in
NumPy (sort() method), and such algorithm is not stable.  For achieving
'stable' sort one should use the 'mergesort' one, but 1) this is slower and
2) uses considerably more memory.



 7. The table.append in PyTables behaves like extend in Python. Why?

 I am assuming you mean Python's list.extend() method.  This is likely
 true because
 the use case of appending a single row is uncommon and so having another
 method
 for it is unnecessary.  Also I think you can get list.append() behavior
 out of the row interface.


Yep.  I considered having two different 'append' and 'extend' methods to be
a bit overlapping, so I decided to stick using just one.



 8. I get a mysterious PerformanceWarning from the PyTables file
 table.py,
   line 2742. This message needs to be split into two messages.  In my
 case,
   after I appended to a table, 'row' in self.__dict__ was True and
   self.row._getUnsavedNrows() was 1.  To resolve the problem, I added a
   line that flushes the table after every append.  Does
   h5file.mytable.flush() do something that h5file.flush() doesn't?  Do I
   need to flush every table after every append or are there only certain
   situations when this is needed?  What does preempted from alive nodes
   mean?

 flushing the file will flush the whole file an all nodes.  Flushing a
 node will simply flush that node.

 The rules for how often you need to flush depend on how much memory you
 have, how expensive
 communication to the processor is, etc.


Exactly.  What this warning is saying is Hey, you are trying to write too
much data without flushing, please use flush() from time to time.  You
can, however, enlarge your cache limits (see
http://pytables.github.com/usersguide/parameter_files.html#cache-limits) if
you don't like these messages (but do that at your own risk!).



 9. Does the following code contain a bug in PyTables?


 Yes, there is a blosc error we are aware of.


Fortunately, it is not an error on Blosc in itself, but in the Blosc HDF5
driver that does not interact well with the 'fletcher32' filter.  See:

https://github.com/PyTables/PyTables/issues/21

Until this is fixed the solution is to not use fletcher32 (you need it for
something special?).



 Thanks a lot for your constructive feedback.  Seriously, if there are
 things which you would like to
 see changed in PyTables and want to help out, we try to be very receptive.
  Pull requests on github
 are a great way to get our attention!


Also, please take some time to subscribe to the list if you want to avoid
Josh resending your posts every time you write. Thanks.

-- 
Francesc Alted
--
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is 

Re: [Pytables-users] Some experiences with PyTables

2011-12-07 Thread Francesc Alted
2011/12/7 Francesc Alted fal...@pytables.org

 What you are saying is correct, except that the 'guilty' of dropping the
 trailing null characters is NumPy, not HDF5.  Look at this:



 In [27]: import numpy as np

 In [28]: np.array([aaa])
 Out[28]:
 array(['aaa'],
   dtype='|S3')

 In [29]: np.array([aaa\x00\x00])
 Out[29]:
 array(['aaa'],
   dtype='|S5')

 Of course, this behaviour for NumPy was discussed long time ago during its
 introduction (around NumPy 1.0 or so, back in 2006), and people (specially
 Travis) found this to be the most convenient for the majority of usages.
 If you are interested in getting the trailing bytes, you can always do:

 In [53]: a = np.array([aaa\x00\x00])

 In [54]: a[0]
 Out[54]: 'aaa'

 In [55]: .join([chr(i) for i in a.view('b')])
 Out[55]: 'aaa\x00\x00'


Hmm, in this case the element to convert is a np.string_ and not a ndarray,
but the solution to get the trailing nulls is even easier:

In [70]: a = np.string_('aaa\x00\x00')

In [71]: a
Out[71]: 'aaa'

In [72]: a.data[:]
Out[72]: 'aaa\x00\x00'

-- 
Francesc Alted
--
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Some experiences with PyTables

2011-12-07 Thread Anthony Scopatz
On Wed, Dec 7, 2011 at 5:51 AM, Josh Moore josh.mo...@gmx.de wrote:


 On Dec 6, 2011, at 11:06 PM, Anthony Scopatz wrote:

 ...snip...

  5. The reference manual for numpy contains _many_ small examples.  They
  partially compensate for any lack of precision or excessive precision in
  the documents.  Also many people learn best from examples.
 
 
  If you would like to write up some additional example or contribute to
 the
  docs in any way
  *please* let me know.  We would be ecstatic for your help!

 Anthony,

 do we have a place in the sphinx docs where cookbook-like examples could
 just be thrown in? If not, could you set one up? That way someone could
 push us a PR with just the modified file.


I don't think that there is anything like that right now, but I'll try to
set something up along these lines.

Be Well
Anthony



 ~J.

 



 --
 Cloud Services Checklist: Pricing and Packaging Optimization
 This white paper is intended to serve as a reference, checklist and point
 of
 discussion for anyone considering optimizing the pricing and packaging
 model
 of a cloud services business. Read Now!
 http://www.accelacomm.com/jaw/sfnl/114/51491232/
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Some experiences with PyTables

2011-12-06 Thread Anthony Scopatz
Hello Edward,

I'd like to respond point by point:

On Tue, Dec 6, 2011 at 2:54 PM, PyTables Org pytab...@googlemail.comwrote:

 1. There seems to be an unpythonic design choice with the start, stop, step
   convention for PyTables.  Anything that is unnatural to a Python
   programmer should be heavily documented.

 Agreed in general.  Do you have specific example we could address?


 2. There may be a bug in itersorted.


Yes, this looks like a bug.  This deserves an issue on github...



 Here is code for (1) and (2):
 
 #! /usr/bin/env python

 import random, tables

 h5file = tables.openFile('mytable.h5', mode='w')

 class silly_class(tables.IsDescription):
num = tables.Int32Col(pos=0)

 mytable = h5file.createTable(h5file.root, 'mytable', silly_class,
 'a few ints', expectedrows=4)

 row = mytable.row
 for i in range(10):
row['num'] = random.randint(0, 99)
row.append()
 mytable.flush()
 mytable.cols.num.createCSIndex()

 # Python's idiom for start, stop, step:
 print 'Python:', range(9, -1, -1)

 output = mytable.readSorted('num', start=0, stop=10, step=-1)
 print 'readSorted:', 0, 10, -1, output

 # copy supports a negative step.  It seems that start and stop are applied
 # _after_ the sort is done.  Very unlike Python.  Please document
 thoroughly.

 We could certainly add some text to the docstring of Table.copy().  Still,
I guess I am
missing how this is 'wrong.' To the best of my knowledge, Python itself has
no single
function which both sorts and slices. (Please correct me if I am wrong
~_~.)  When
performing both operations one needs to be done first.  However, you are
correct in that
this could be better documented.


 print h5file.root.mytable[:]
 h5file.root.mytable.copy(h5file.root, 'mytable2', sortby='num',
 start=0, stop=5, step=-1)
 print h5file.root.mytable2[:]

 # The following raises an OverflowError.  The documentation (2.3.1) says
 # negative steps are supported for itersorted.  Documentation error or bug
 # in itersorted?
 output = [x['num'] for x in mytable.itersorted('num',
 start=0, stop=10, step=-1)]
 print 'itersorted:', 0, 10, -1, output
 
 3. Null bytes are stripped from the end of strings when they are stored in
 a
   table.  Since a Python does not expect this, it needs to be explicitly
   documented in all the relevant places.  Here is some code:

 This is a function of the underlying HDF5 storage mechanism and not
explicitly PyTables.
When storing fixed length strings, the array of characters it is converted
to *must* be exactly
length-N.  When serializing a string of length-M, HDF5 does the following:

1.  M  N: truncate the string at N bytes (chop off the end).
2. M == N: do nothing.
3.  M  N:  pad the character array with N - M null characters to achieve
length N.

Because of this technique, when deserializing all trailing null characters
are dropped.
This supports the much more common use case of storing shorter strings in a
longer
buffer but wanting to only recover the shorter version.

If you wanted to append null bytes to the end of the string, you could
always store the
python length (M) in another column.

  
 #! /usr/bin/env python

 import tables

 def hash2hex(stringin):
out = list()
for c in stringin:
s = hex(ord(c))[2:]
if len(s) == 1:
s = '0' + s
out.append(s)
return ''.join(out)

 h5file = tables.openFile('mytable.h5', mode='w')

 class silly_class(tables.IsDescription):
astring = tables.StringCol(16, pos=0)

 mytable = h5file.createTable(h5file.root, 'mytable', silly_class,
 'a few strings', expectedrows=4)

 # Problem when string ends with null bytes:
 nasty = 'abdcef' + '\x00\x00'
 print repr(nasty)
 print hash2hex(nasty)

 row = mytable.row
 row['astring'] = nasty
 row.append()
 mytable.flush()
 print repr(mytable[0][0])
 print hash2hex(mytable[0][0])
 h5file.close()
 
 4. Has the 64K limit for attributes been lifted?


No, unfortunately.  Once again, this is a compile time parameter of HDF5.
You could change this value and recompile HDF5, but then any h5 file you
create would not be portable with other versions of HDF5.  Trust me, you
are
not the only one who wishes this were as run-time variable.  (Still there
are good
reasons for it being static, ie speed and size)



 5. The reference manual for numpy contains _many_ small examples.  They
   partially compensate for any lack of precision or excessive precision in
   the documents.  Also many people learn best from examples.


If you would like to write up some additional example or contribute to the
docs in any way
*please* let me know.  We would be ecstatic for your help!



 6. Suppose that the records (key, data1) and (key, data2) are two rows in a
   table with (key, data1) being a earlier row than (key, data2).  Both
   records have the same value in the first column.  If a CSIndex is created
   using the first column, will (key, data1) still be before (key, data2) in