[Pytables-users] Nested Iteration of HDF5 using PyTables

2013-01-03 Thread David Reed
I was hoping someone could help me out here.

This is from a post I put up on StackOverflow,

I am have a fairly large dataset that I store in HDF5 and access using
PyTables. One operation I need to do on this dataset are pairwise
comparisons between each of the elements. This requires 2 loops, one to
iterate over each element, and an inner loop to iterate over every other
element. This operation thus looks at N(N-1)/2 comparisons.

For fairly small sets I found it to be faster to dump the contents into a
multdimensional numpy array and then do my iteration. I run into problems
with large sets because of memory issues and need to access each element of
the dataset at run time.

Putting the elements into an array gives me about 600 comparisons per
second, while operating on hdf5 data itself gives me about 300 comparisons
per second.

Is there a way to speed this process up?

Example follows (this is not my real code, just an example):

*Small Set*:

with tb.openFile(h5_file, 'r') as f:
data = f.root.data

N_elements = len(data)
elements = np.empty((N_irises, 1e5))

for ii, d in enumerate(data):
elements[ii] = data['element']

D = np.empty((N_irises, N_irises))  for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(elements[ii], elements[jj])

 *Large Set*:

with tb.openFile(h5_file, 'r') as f:
data = f.root.data

N_elements = len(data)

D = np.empty((N_irises, N_irises))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
 D[ii, jj] = compare(data['element'][ii], data['element'][jj])
--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Nested Iteration of HDF5 using PyTables

2013-01-03 Thread Anthony Scopatz
HI David,

Tables and table column iteration have been overhauled fairly recently [1].
 So you might try creating two iterators, offset by one, and then doing the
comparison.  I am hacking this out super quick so please forgive me:

from itertools import izip

with tb.openFile(...) as f:
data = f.root.data
data_i = iter(data)
data_j = iter(data)
data_i.next() # throw the first value away
for i, j in izip(data_i, data_j):
compare(i, j)

You get the idea ;)

Be Well
Anthony

1. https://github.com/PyTables/PyTables/issues/27


On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote:

 I was hoping someone could help me out here.

 This is from a post I put up on StackOverflow,

 I am have a fairly large dataset that I store in HDF5 and access using
 PyTables. One operation I need to do on this dataset are pairwise
 comparisons between each of the elements. This requires 2 loops, one to
 iterate over each element, and an inner loop to iterate over every other
 element. This operation thus looks at N(N-1)/2 comparisons.

 For fairly small sets I found it to be faster to dump the contents into a
 multdimensional numpy array and then do my iteration. I run into problems
 with large sets because of memory issues and need to access each element of
 the dataset at run time.

 Putting the elements into an array gives me about 600 comparisons per
 second, while operating on hdf5 data itself gives me about 300 comparisons
 per second.

 Is there a way to speed this process up?

 Example follows (this is not my real code, just an example):

 *Small Set*:


 with tb.openFile(h5_file, 'r') as f:
 data = f.root.data

 N_elements = len(data)
 elements = np.empty((N_irises, 1e5))

 for ii, d in enumerate(data):
 elements[ii] = data['element']

 D = np.empty((N_irises, N_irises))  for ii in xrange(N_elements):
 for jj in xrange(ii+1, N_elements):
 D[ii, jj] = compare(elements[ii], elements[jj])

  *Large Set*:


 with tb.openFile(h5_file, 'r') as f:
 data = f.root.data

 N_elements = len(data)

 D = np.empty((N_irises, N_irises))
 for ii in xrange(N_elements):
 for jj in xrange(ii+1, N_elements):
  D[ii, jj] = compare(data['element'][ii], data['element'][jj])



 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Nested Iteration of HDF5 using PyTables

2013-01-03 Thread Josh Ayers
David,

The change in issue 27 was only for iteration over a tables.Column
instance.  To use it, tweak Anthony's code as follows.  This will iterate
over the element column, as in your original example.

Note also that this will only work with the development version of PyTables
available on github.  It will be very slow using the released v2.4.0.


from itertools import izip

with tb.openFile(...) as f:
data = f.root.data.cols.element
data_i = iter(data)
data_j = iter(data)
data_i.next() # throw the first value away
for i, j in izip(data_i, data_j):
compare(i, j)


Hope that helps,
Josh



On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote:

 HI David,

 Tables and table column iteration have been overhauled fairly recently
 [1].  So you might try creating two iterators, offset by one, and then
 doing the comparison.  I am hacking this out super quick so please forgive
 me:

 from itertools import izip

 with tb.openFile(...) as f:
 data = f.root.data
 data_i = iter(data)
 data_j = iter(data)
 data_i.next() # throw the first value away
 for i, j in izip(data_i, data_j):
 compare(i, j)

 You get the idea ;)

 Be Well
 Anthony

 1. https://github.com/PyTables/PyTables/issues/27


 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote:

 I was hoping someone could help me out here.

 This is from a post I put up on StackOverflow,

 I am have a fairly large dataset that I store in HDF5 and access using
 PyTables. One operation I need to do on this dataset are pairwise
 comparisons between each of the elements. This requires 2 loops, one to
 iterate over each element, and an inner loop to iterate over every other
 element. This operation thus looks at N(N-1)/2 comparisons.

 For fairly small sets I found it to be faster to dump the contents into a
 multdimensional numpy array and then do my iteration. I run into problems
 with large sets because of memory issues and need to access each element of
 the dataset at run time.

 Putting the elements into an array gives me about 600 comparisons per
 second, while operating on hdf5 data itself gives me about 300 comparisons
 per second.

 Is there a way to speed this process up?

 Example follows (this is not my real code, just an example):

 *Small Set*:


 with tb.openFile(h5_file, 'r') as f:
 data = f.root.data

 N_elements = len(data)
 elements = np.empty((N_irises, 1e5))

 for ii, d in enumerate(data):
 elements[ii] = data['element']

 D = np.empty((N_irises, N_irises))  for ii in xrange(N_elements):
 for jj in xrange(ii+1, N_elements):
 D[ii, jj] = compare(elements[ii], elements[jj])

  *Large Set*:


 with tb.openFile(h5_file, 'r') as f:
 data = f.root.data

 N_elements = len(data)

 D = np.empty((N_irises, N_irises))
 for ii in xrange(N_elements):
 for jj in xrange(ii+1, N_elements):
  D[ii, jj] = compare(data['element'][ii], data['element'][jj])



 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users




 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Nested Iteration of HDF5 using PyTables

2013-01-03 Thread Anthony Scopatz
Yup, that is right, thanks Josh!


On Thu, Jan 3, 2013 at 12:29 PM, Josh Ayers josh.ay...@gmail.com wrote:

 David,

 The change in issue 27 was only for iteration over a tables.Column
 instance.  To use it, tweak Anthony's code as follows.  This will iterate
 over the element column, as in your original example.

 Note also that this will only work with the development version of
 PyTables available on github.  It will be very slow using the released
 v2.4.0.


 from itertools import izip

 with tb.openFile(...) as f:
 data = f.root.data.cols.element
 data_i = iter(data)
 data_j = iter(data)
 data_i.next() # throw the first value away
 for i, j in izip(data_i, data_j):
 compare(i, j)


 Hope that helps,
 Josh



 On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote:

 HI David,

 Tables and table column iteration have been overhauled fairly recently
 [1].  So you might try creating two iterators, offset by one, and then
 doing the comparison.  I am hacking this out super quick so please forgive
 me:

 from itertools import izip

 with tb.openFile(...) as f:
 data = f.root.data
 data_i = iter(data)
 data_j = iter(data)
 data_i.next() # throw the first value away
 for i, j in izip(data_i, data_j):
 compare(i, j)

 You get the idea ;)

 Be Well
 Anthony

 1. https://github.com/PyTables/PyTables/issues/27


 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.comwrote:

 I was hoping someone could help me out here.

 This is from a post I put up on StackOverflow,

 I am have a fairly large dataset that I store in HDF5 and access using
 PyTables. One operation I need to do on this dataset are pairwise
 comparisons between each of the elements. This requires 2 loops, one to
 iterate over each element, and an inner loop to iterate over every other
 element. This operation thus looks at N(N-1)/2 comparisons.

 For fairly small sets I found it to be faster to dump the contents into
 a multdimensional numpy array and then do my iteration. I run into problems
 with large sets because of memory issues and need to access each element of
 the dataset at run time.

 Putting the elements into an array gives me about 600 comparisons per
 second, while operating on hdf5 data itself gives me about 300 comparisons
 per second.

 Is there a way to speed this process up?

 Example follows (this is not my real code, just an example):

 *Small Set*:



 with tb.openFile(h5_file, 'r') as f:
 data = f.root.data

 N_elements = len(data)
 elements = np.empty((N_irises, 1e5))

 for ii, d in enumerate(data):
 elements[ii] = data['element']

 D = np.empty((N_irises, N_irises))  for ii in xrange(N_elements):
 for jj in xrange(ii+1, N_elements):
 D[ii, jj] = compare(elements[ii], elements[jj])

  *Large Set*:



 with tb.openFile(h5_file, 'r') as f:
 data = f.root.data

 N_elements = len(data)

 D = np.empty((N_irises, N_irises))
 for ii in xrange(N_elements):
 for jj in xrange(ii+1, N_elements):
  D[ii, jj] = compare(data['element'][ii], data['element'][jj])



 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users




 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users




 --
 Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
 MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
 with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
 MVPs and experts. ON SALE this month only -- learn more at:
 http://p.sf.net/sfu/learnmore_122712
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5,