[Pytables-users] Nested Iteration of HDF5 using PyTables
I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Nested Iteration of HDF5 using PyTables
HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Nested Iteration of HDF5 using PyTables
David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the element column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote: HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 2
Thanks Anthony, but unless Im missing something I don't think that method will work since this will only be comparing the ith element with ith+1 element. I still need 2 for loops right? Using itertools might speed things up though, I've never used them so I will give it a shot and let you know how it goes. Looks like I need to download the latest release before I do that too. Thanks for the help. -Dave On Thu, Jan 3, 2013 at 12:12 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Anthony Scopatz) -- Message: 1 Date: Thu, 3 Jan 2013 11:11:47 -0600 From: Anthony Scopatz scop...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: CAPk-6T5b= 1egagp4+jhjcd3_4fnvbxrob2jbhay45rwdqzy...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- next part -- An HTML attachment was scrubbed... -- -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 3
Thanks a lot for the help so far guys! Looking at itertools, I found what I believe to be the perfect function for what I need, itertools.combinations. This appears to be a valid replacement to the method proposed. There is a small problem that I didn't mention is that my compare function actually takes as inputs 2 columns from the table. Like so: D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element1'][ii], data['element1'][jj],data['element2'][ii], data['element2'][jj]) Is there an efficient way of using itertools with this structure? On Thu, Jan 3, 2013 at 1:29 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Josh Ayers) -- Message: 1 Date: Thu, 3 Jan 2013 10:29:33 -0800 From: Josh Ayers josh.ay...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: cacob4anozyd7dafos7sxs07mchzb8zbripbbrvbazrv4weq...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the element column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote: HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj])
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 4
I apologize if I'm starting to sound helpless, but I'm forced to work on Windows 7 at work and have never had luck compiling python source successfully. I have had to rely on precompiled binaries and now its biting me in the butt. Is there any quick fix I can do to improve this iteration using v2.4.0? On Thu, Jan 3, 2013 at 3:17 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Pytables-users Digest, Vol 80, Issue 2 (David Reed) 2. Re: Pytables-users Digest, Vol 80, Issue 3 (David Reed) -- Message: 1 Date: Thu, 3 Jan 2013 13:44:29 -0500 From: David Reed david.ree...@gmail.com Subject: Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 2 To: pytables-users@lists.sourceforge.net Message-ID: CAM6XA7=8ocg5WPD4KLSvLhSw-3BCvq5u7MRxq3Ajd6ha= ev...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Thanks Anthony, but unless Im missing something I don't think that method will work since this will only be comparing the ith element with ith+1 element. I still need 2 for loops right? Using itertools might speed things up though, I've never used them so I will give it a shot and let you know how it goes. Looks like I need to download the latest release before I do that too. Thanks for the help. -Dave On Thu, Jan 3, 2013 at 12:12 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Anthony Scopatz) -- Message: 1 Date: Thu, 3 Jan 2013 11:11:47 -0600 From: Anthony Scopatz scop...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: CAPk-6T5b= 1egagp4+jhjcd3_4fnvbxrob2jbhay45rwdqzy...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements =
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 4
The change was in pure Python code, so you should be able to just paste in the changes to your local copy. Start with the table.Column.__iter__ method (lines 3296-3310) here. https://github.com/PyTables/PyTables/blob/b479ed025f4636f7f4744ac83a89bc947808907c/tables/table.py It needs to be modified slightly because it uses some additional features that aren't available in the released version (the out=buf_slice argument to table.read). The following should work. def __iter__(self): table = self.table itemsize = self.dtype.itemsize nrowsinbuf = table._v_file.params['IO_BUFFER_SIZE'] // itemsize max_row = len(self) for start_row in xrange(0, len(self), nrowsinbuf): end_row = min([start_row + nrowsinbuf, max_row]) buf = table.read(start_row, end_row, 1, field=self.pathname) for row in buf: yield row I haven't tested this, but I think it will work. Josh On Thu, Jan 3, 2013 at 1:25 PM, David Reed david.ree...@gmail.com wrote: I apologize if I'm starting to sound helpless, but I'm forced to work on Windows 7 at work and have never had luck compiling python source successfully. I have had to rely on precompiled binaries and now its biting me in the butt. Is there any quick fix I can do to improve this iteration using v2.4.0? On Thu, Jan 3, 2013 at 3:17 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Pytables-users Digest, Vol 80, Issue 2 (David Reed) 2. Re: Pytables-users Digest, Vol 80, Issue 3 (David Reed) -- Message: 1 Date: Thu, 3 Jan 2013 13:44:29 -0500 From: David Reed david.ree...@gmail.com Subject: Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 2 To: pytables-users@lists.sourceforge.net Message-ID: CAM6XA7=8ocg5WPD4KLSvLhSw-3BCvq5u7MRxq3Ajd6ha= ev...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Thanks Anthony, but unless Im missing something I don't think that method will work since this will only be comparing the ith element with ith+1 element. I still need 2 for loops right? Using itertools might speed things up though, I've never used them so I will give it a shot and let you know how it goes. Looks like I need to download the latest release before I do that too. Thanks for the help. -Dave On Thu, Jan 3, 2013 at 12:12 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Anthony Scopatz) -- Message: 1 Date: Thu, 3 Jan 2013 11:11:47 -0600 From: Anthony Scopatz scop...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: CAPk-6T5b= 1egagp4+jhjcd3_4fnvbxrob2jbhay45rwdqzy...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a
Re: [Pytables-users] Nested Iteration of HDF5 using PyTables
Yup, that is right, thanks Josh! On Thu, Jan 3, 2013 at 12:29 PM, Josh Ayers josh.ay...@gmail.com wrote: David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the element column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote: HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.comwrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5,
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 3
On Thu, Jan 3, 2013 at 2:17 PM, David Reed david.ree...@gmail.com wrote: Thanks a lot for the help so far guys! Looking at itertools, I found what I believe to be the perfect function for what I need, itertools.combinations. This appears to be a valid replacement to the method proposed. Yes, combinations is awesome! There is a small problem that I didn't mention is that my compare function actually takes as inputs 2 columns from the table. Like so: D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element1'][ii], data['element1'][jj],data['element2'][ii], data['element2'][jj]) Is there an efficient way of using itertools with this structure? You can always make two other iterators for each column. Since you have two columns you would have 4 iterators. I am not sure how fast this is going to be but I am confident that there is definitely a way to do this in one for-loop, which is going to be way faster than nested loops. Be Well Anthony On Thu, Jan 3, 2013 at 1:29 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Josh Ayers) -- Message: 1 Date: Thu, 3 Jan 2013 10:29:33 -0800 From: Josh Ayers josh.ay...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: cacob4anozyd7dafos7sxs07mchzb8zbripbbrvbazrv4weq...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the element column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote: HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements):
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 4
Josh is right that you can just edit the code by hand (which works but sucks). However, on Windows -- on the rare occasion when I also have to develop on it -- I typically use a distribution that includes a compiler, cython, hdf5, and pytables already and then I install my development version from github OVER this. I recommend either EPD or Anaconda, though other distributions listed here [1] might also work. Be well Anthony 1. http://numfocus.org/projects-2/software-distributions/ On Thu, Jan 3, 2013 at 3:46 PM, Josh Ayers josh.ay...@gmail.com wrote: The change was in pure Python code, so you should be able to just paste in the changes to your local copy. Start with the table.Column.__iter__ method (lines 3296-3310) here. https://github.com/PyTables/PyTables/blob/b479ed025f4636f7f4744ac83a89bc947808907c/tables/table.py It needs to be modified slightly because it uses some additional features that aren't available in the released version (the out=buf_slice argument to table.read). The following should work. def __iter__(self): table = self.table itemsize = self.dtype.itemsize nrowsinbuf = table._v_file.params['IO_BUFFER_SIZE'] // itemsize max_row = len(self) for start_row in xrange(0, len(self), nrowsinbuf): end_row = min([start_row + nrowsinbuf, max_row]) buf = table.read(start_row, end_row, 1, field=self.pathname) for row in buf: yield row I haven't tested this, but I think it will work. Josh On Thu, Jan 3, 2013 at 1:25 PM, David Reed david.ree...@gmail.com wrote: I apologize if I'm starting to sound helpless, but I'm forced to work on Windows 7 at work and have never had luck compiling python source successfully. I have had to rely on precompiled binaries and now its biting me in the butt. Is there any quick fix I can do to improve this iteration using v2.4.0? On Thu, Jan 3, 2013 at 3:17 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Pytables-users Digest, Vol 80, Issue 2 (David Reed) 2. Re: Pytables-users Digest, Vol 80, Issue 3 (David Reed) -- Message: 1 Date: Thu, 3 Jan 2013 13:44:29 -0500 From: David Reed david.ree...@gmail.com Subject: Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 2 To: pytables-users@lists.sourceforge.net Message-ID: CAM6XA7=8ocg5WPD4KLSvLhSw-3BCvq5u7MRxq3Ajd6ha= ev...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Thanks Anthony, but unless Im missing something I don't think that method will work since this will only be comparing the ith element with ith+1 element. I still need 2 for loops right? Using itertools might speed things up though, I've never used them so I will give it a shot and let you know how it goes. Looks like I need to download the latest release before I do that too. Thanks for the help. -Dave On Thu, Jan 3, 2013 at 12:12 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Anthony Scopatz) -- Message: 1 Date: Thu, 3 Jan 2013 11:11:47 -0600 From: Anthony Scopatz scop...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: CAPk-6T5b= 1egagp4+jhjcd3_4fnvbxrob2jbhay45rwdqzy...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out