Hello Giovanni,
Great to hear that everything is working much better for you now and that
everything is much faster and smaller than NPY ;)
Do you know how the default value is set btw?
This is computed via a magical heuristic algorithm written by Francesc (?)
called computechunksize().
This is really optimized for dense data (Tables) so it is
not surprising that in performs poorly in your case. Any updates you want
to make to PyTables to also handle sparse data well out of the box would be
very welcome ;)
1. https://github.com/PyTables/PyTables/blob/develop/tables/idxutils.py#L54
On Mon, Jun 24, 2013 at 10:51 AM, Giovanni Luca Ciampaglia
glciamp...@gmail.com wrote:
Hi Anthony,
thanks for the explanation and the links, it's much clearer now. So without
compression a CArray is really a smarter type of sparse file, but you have
to
set a sensible chunk shape. Do you know how the default value is set btw?
I am
asking because I didn't see any change in performance from using the
default
value and using (1, N), where (N,N) is the shape of the matrix. I guess
that the
write performance depends crucially on the size of the I/O buffer, so the
default must be choosing a similar setting.
Anyway I have played a bit with other values of the chunk shape in
conjunction
with the compression level and using a shape (1,100) and a complevel=5
gives
speeds that are only 10-15% slower than what I get at shape=(1,1) and
complevel=0. The resulting file is 10 times smaller, and something like 35
times
smaller than a NPY sparse file, btw!
Thanks!
Giovanni
On 06/24/2013 05:25 AM, pytables-users-request@lists.sourceforge.netwrote:
Hi Giovanni!
I think that you may have some misunderstanding about how chucking works,
which is leading you to get terrible performance. In fact what you
describe is a great strategy (right all and zip) for using normal Arrays.
However, chunking and CArrays don't work like this. If a chunk contains
no
data, it is not written at all! Also, all zipping takes place on the
chunk
level. Thus for very small chunks you can actually increase the file
size
and access time by using compression.
For sparse matrices and CArrays, you need to play around with the
chunkshape argument to create_carray() and compression. Performance is
going to be affected how dense the matrix is and how grouped it is. For
example, for a very dense and randomly distributed matrix, chunkshape=1
and
no compression is best. For block diagonal matrices, the chunkshape
should
be the nominal block shape. Compression is only useful here if the
blocks
all have similar values or the block shape is large. For example
1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 0 0
0 0 0 0 1 1
0 0 0 0 1 1
is well suited to a chunkshape=(2, 2)
For more information on the HDF model please see my talk slides and video
[1,2] I hope this helps.
Be Well
Anthony
PS. Glad to see you using the new API
1.https://github.com/scopatz/hdf5-is-for-lovers
2.http://www.youtube.com/watch?v=Nzx0HAd3FiI
--
Giovanni Luca Ciampaglia
Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciam...@indiana.edu
--
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users
--
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users