Re: [Pytables-users] Speed of CArray writing sparse matrices

2013-06-24 Thread Anthony Scopatz
Hello Giovanni,

Great to hear that everything is working much better for you now and that
everything is much faster and smaller than NPY ;)

Do you know how the default value is set btw?


This is computed via a magical heuristic algorithm written by Francesc (?)
called computechunksize().

This is really optimized for dense data (Tables) so it is
not surprising that in performs poorly in your case.  Any updates you want
to make to PyTables to also handle sparse data well out of the box would be
very welcome ;)

1. https://github.com/PyTables/PyTables/blob/develop/tables/idxutils.py#L54



On Mon, Jun 24, 2013 at 10:51 AM, Giovanni Luca Ciampaglia 
glciamp...@gmail.com wrote:

 Hi Anthony,

 thanks for the explanation and the links, it's much clearer now. So without
 compression a CArray is really a smarter type of sparse file, but you have
 to
 set a sensible chunk shape. Do you know how the default value is set btw?
 I am
 asking because I didn't see any change in performance from using the
 default
 value and using (1, N), where (N,N) is the shape of the matrix. I guess
 that the
 write performance depends crucially on the size of the I/O buffer, so the
 default must be choosing a similar setting.

 Anyway I have played a bit with other values of the chunk shape in
 conjunction
 with the compression level and using a shape (1,100) and a complevel=5
 gives
 speeds that are only 10-15% slower than what I get at shape=(1,1) and
 complevel=0. The resulting file is 10 times smaller, and something like 35
 times
 smaller than a NPY sparse file, btw!

 Thanks!

 Giovanni

 On 06/24/2013 05:25 AM, pytables-users-request@lists.sourceforge.netwrote:
  Hi Giovanni!
 
  I think that you may have some misunderstanding about how chucking works,
  which is leading you to get terrible performance.  In fact what you
  describe is a great strategy (right all and zip) for using normal Arrays.
 
  However, chunking and CArrays don't work like this.  If a chunk contains
 no
  data, it is not written at all!  Also, all zipping takes place on the
 chunk
  level.  Thus for very small chunks you can actually increase the file
 size
  and access time by using compression.
 
  For sparse matrices and CArrays, you need to play around with the
  chunkshape argument to create_carray()  and compression.  Performance is
  going to be affected how dense the matrix is and how grouped it is.  For
  example, for a very dense and randomly distributed matrix, chunkshape=1
 and
  no compression is best.  For block diagonal matrices, the chunkshape
 should
  be the nominal block shape.  Compression is only useful here if the
 blocks
  all have similar values or the block shape is large.  For example
 
  1 1 0 0 0 0
  1 1 0 0 0 0
  0 0 1 1 0 0
  0 0 1 1 0 0
  0 0 0 0 1 1
  0 0 0 0 1 1
 
  is well suited to a chunkshape=(2, 2)
 
  For more information on the HDF model please see my talk slides and video
[1,2]  I hope this helps.
 
  Be Well
  Anthony
 
  PS. Glad to see you using the new API
 
  1.https://github.com/scopatz/hdf5-is-for-lovers
  2.http://www.youtube.com/watch?v=Nzx0HAd3FiI


 --
 Giovanni Luca Ciampaglia

 Postdoctoral fellow
 Center for Complex Networks and Systems Research
 Indiana University

 ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
 ☞ http://cnets.indiana.edu/
 ✉ gciam...@indiana.edu



 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Speed of CArray writing sparse matrices

2013-06-23 Thread Anthony Scopatz
Hi Giovanni!

I think that you may have some misunderstanding about how chucking works,
which is leading you to get terrible performance.  In fact what you
describe is a great strategy (right all and zip) for using normal Arrays.

However, chunking and CArrays don't work like this.  If a chunk contains no
data, it is not written at all!  Also, all zipping takes place on the chunk
level.  Thus for very small chunks you can actually increase the file size
and access time by using compression.

For sparse matrices and CArrays, you need to play around with the
chunkshape argument to create_carray()  and compression.  Performance is
going to be affected how dense the matrix is and how grouped it is.  For
example, for a very dense and randomly distributed matrix, chunkshape=1 and
no compression is best.  For block diagonal matrices, the chunkshape should
be the nominal block shape.  Compression is only useful here if the blocks
all have similar values or the block shape is large.  For example

1 1 0 0 0 0
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 0 0
0 0 0 0 1 1
0 0 0 0 1 1

is well suited to a chunkshape=(2, 2)

For more information on the HDF model please see my talk slides and video
:) [1,2]  I hope this helps.

Be Well
Anthony

PS. Glad to see you using the new API ;)

1. https://github.com/scopatz/hdf5-is-for-lovers
2. http://www.youtube.com/watch?v=Nzx0HAd3FiI


On Sat, Jun 22, 2013 at 6:34 PM, Giovanni Luca Ciampaglia 
glciamp...@gmail.com wrote:

 Hi all,

 I have a sparse 3.4M x 3.4M adjacency matrix with nnz = 23M and wanted
 to see if CArray was an appropriate solution for storing it. Right now I
 am using the NumPy binary format for storing the data in coordinate
 format and loading the matrix with Scipy's sparse coo_matrix class. As
 far as I understand, with CArray the matrix would be written in full
 (zeros included) but a) since it's chunked accessing it does not take
 memory and b) with compression enabled it would possible to keep the
 size of the file reasonable.

 If my assumptions are correct, then here is my problem: I am running
 into problems when writing the CArray to disk. I adapted the example
 from the documentation [1] and when I run the code on a 6000x6000 matrix
 with nnz = 17K I achieve a decent speed of roughly 4100 elements/s.
 However, when I try it on the full matrix the writing speed drops to 4
 elements/s. Am I doing something wrong? Any feedback would be greatly
 appreciated!

 Code: https://gist.github.com/junkieDolphin/5843064

 Cheers,

 Giovanni

 [1]

 http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-carray-class

 --
 Giovanni Luca Ciampaglia

 ☞ http://www.inf.usi.ch/phd/ciampaglia/
 ✆ (812) 287-3471
 ✉ glciamp...@gmail.com



 --
 This SF.net email is sponsored by Windows:

 Build for Windows Store.

 http://p.sf.net/sfu/windows-dev2dev
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users