Re: [Pytables-users] Question about Leaf.remove() method
Hello Premal, This is just how HDF5 works. When you delete a Leaf the reference to that node is removed and the space in the file becomes available for future use. However, HDF5 will not reduce files, it will only grow them. Thus new data could fill in the used space but it doesn't go away. It just sits there empty. If you really want to get rid of this extraneous space you should use the ptrepack or h5repack command line utilities to create a clean copy of the file. Hope this helps. Be Well Anthony On Thu, Aug 29, 2013 at 10:40 AM, Forafo San ppv.g...@gmail.com wrote: Hello All, I have some data in an HDF5 file that is created with PyTables. Occasionally, I update the data by reading in one of the tables and adding or deleting rows. Then, I create a new table containing the updated data, give it a random name, and let it reside in the same group where the old table resides. I flush the new table, then use the table.remove() (or Leaf.remove()) method to delete the old table and table.rename() method to rename the randomly-named new table to the same name as the old table. Problem: In a small sized table, the size of the hdf5 file doubles with the above process even when no new rows or other modifications are made (let's assume that the hdf5 file contains only this table). A ptdump indicates no presence of the old table. In a medium-sized table, the size of the hdf5 file rises substantially (20% or 30%) even when no new rows or columns are added. Do I understand the table.remove() right as completely deleting the table? Does it leave some residue that I should be aware of? All help is appreciated. Thanks, Premal -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] modifying a table column
Hey Sasha, You probably want to look at the Expr class [1] where you set out to be the same as the original array. Be Well Anthony 1. http://pytables.github.io/usersguide/libref/expr_class.html On Tue, Aug 27, 2013 at 11:44 AM, Oleksandr Huziy guziy.sa...@gmail.comwrote: Hi All: I have a huge table imported from other binary files to hdf, and I forgot to multiply the data by a factor in one case. Is there an easy way to multiply a column by a constant factor using pytables? To modify it in place? Thank you -- Sasha -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] modifying a table column
On Tue, Aug 27, 2013 at 6:50 PM, Oleksandr Huziy guziy.sa...@gmail.comwrote: Hi Again: 2013/8/27 Anthony Scopatz scop...@gmail.com Hey Sasha, You probably want to look at the Expr class [1] where you set out to be the same as the original array. Be Well Anthony 1. http://pytables.github.io/usersguide/libref/expr_class.html I just wanted to make sure if it is possible to use an assignment in expressions? (this gives me a syntax error exception, complains about the equal sign in the expression) Hi Sasha, Assignment is a statement not an expression, so it is not possible to use here. This is why you are getting a syntax error. h = tb.open_file(path, mode=a) varTable = h.get_node(/, var_name) coef = 3 * 60 * 60 #output step expr = tb.Expr(c = c * m, uservars = {c: varTable.cols.field, m: coef }) expr.eval() varTable.flush() h.close() Is this an optimal way of multiplying a column? (this one works, but I think it loads all the data into memory...right?) expr = tb.Expr(c * m, uservars = {c: varTable.cols.field, m: coef }) varTable.cols.field[:] = expr.eval() You are right that this loads the entire computed array into memory and is therefore not optimal. I would do something like the following: h = tb.open_file(path, mode=a) varTable = h.get_node(/, var_name) coef = 3 * 60 * 60 #output step c = varTable.cols.field expr = tb.Expr(c = c * m, uservars = {c: c, m: coef }) expr.set_output(c) expr.eval() varTable.flush() h.close() Be Well Anthony Thank you Cheers On Tue, Aug 27, 2013 at 11:44 AM, Oleksandr Huziy guziy.sa...@gmail.com wrote: Hi All: I have a huge table imported from other binary files to hdf, and I forgot to multiply the data by a factor in one case. Is there an easy way to multiply a column by a constant factor using pytables? To modify it in place? Thank you -- Sasha -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] modifying a table column
Glad I could help! On Tue, Aug 27, 2013 at 7:44 PM, Oleksandr Huziy guziy.sa...@gmail.comwrote: 2013/8/27 Anthony Scopatz scop...@gmail.com You are right that this loads the entire computed array into memory and is therefore not optimal. I would do something like the following: h = tb.open_file(path, mode=a) varTable = h.get_node(/, var_name) coef = 3 * 60 * 60 #output step c = varTable.cols.field expr = tb.Expr(c * m, uservars = {c: c, m: coef }) expr.set_output(c) expr.eval() varTable.flush() h.close() Aha, this is cool. Thanks Anthony. Cheers -- Sasha On Tue, Aug 27, 2013 at 11:44 AM, Oleksandr Huziy guziy.sa...@gmail.com wrote: Hi All: I have a huge table imported from other binary files to hdf, and I forgot to multiply the data by a factor in one case. Is there an easy way to multiply a column by a constant factor using pytables? To modify it in place? Thank you -- Sasha -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Numpy Arrays to Structure Array or Table
Hi David, I think that you can do what you want in one, rather long line: hfile.createTable(grp, 'signal', description=np.array(zip(some_func(t, v)), dtype=[('time', np.float64), ('value', np.float64)])) Or two nicer lines: arr = np.array(zip(some_func(t, v)), dtype=[('time', np.float64), ('value', np.float64)]) hfile.createTable(grp, 'signal', description=arr) zip() is your friend =). If zip is too slow and you don't want to make more than one copy, you could try something like this: temparr = np.array(some_func(t, v)).T arr = np.view(temparr, dtype=[('time', np.float64), ('value', np.float64)]) This really only works because both columns have the same dtype. Of course, you can always keep basically what you have and loop through the column names programmaticly: for name, col in zip(A.dtype.names, some_func(t, v)): A[name] = col I hope this helps! Be Well Anthony On Wed, Aug 7, 2013 at 5:58 PM, David Reed david.ree...@gmail.com wrote: Hi there, I have some generic functions that take time series data with 2 numpy array arguments, time and value, and return 2 numpy arrays of time and value. I would like to place these arrays into a Numpy structured array or directly into a new pytables table with fields, time and value. Now Ive found I could do this: t, v = some_func(t, v) A = np.empty(len(t), dtype=[('time', np.float64), ('value', np.float64)]) A['time'] = t A['value'] = v hfile.createTable(grp, 'signal', description=A) hfile.flush() But this seems rather clunky and inefficient. Any suggestions to make this repackaging a little smoother? -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] suitable for storing data like k-v style?
Hi Jason, A key-value store pattern is definitely supported. However, be forewarned that groups are implemented using B-trees, not hash tables. However, with data of your size most of the access time will be in the leaf nodes and not getting the group. I'd say try it out and see. Be Well Anthony On Wed, Aug 7, 2013 at 11:33 AM, Xianli Xu xiaolou.c...@gmail.com wrote: Hi all, I'm developing data processing service and evaluating if Pytable. Since hdf5 supports hierarchical data like a tree of folder, can I use such a tree-like structure as a K-V store like possibly store million of tables or arrays under one group and randomly access any one of them in O(1) time? e.g. root/ user_log/ uid1- table / array, (of tens of thousand rows / elements, ETL'ed user log info in int format) uid2- table / array, uid3- table / array, uid4- table / array, uid5- table / array, …… (perhaps million user) Just wondering how the hierarchical structure is implemented and such usage pattern is supported? if no, is there any running or better way to store such type of information? We adopt Pytables because the data is stored in higher density, faster loaded and no ACID / concurrency overhead, so traditional DB and no-sql db is not our option.. Thanks, Jason -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] dates and space
On Mon, Aug 5, 2013 at 1:38 PM, Oleksandr Huziy guziy.sa...@gmail.comwrote: Hi Pytables users and developers: I have a few questions to which I could not find the answer in the documentation. Thank you in advance for any help. 1. If I store dates in Pytables, does it mean I could write queries like table.where('date.month == 5')? Is there a common way to pass from python's datetime to pytable's datetime and inversely? Hello Sasha, Pytables times are the actual based off of C time, not Python's date times. This is because they use the HDF5 time types. So unfortunately you can't write queries like the one above. (You'd need to talk to numexpr about getting that kind of query implemented ~_~.) Instead I would suggest that you store your times as Float64Atoms and Float64Cols and then use arithmetic to figure out the query: table.where((x / 3600 / 24)%12 == 5) This is not perfect... 2. I have several variables stored in the same file in a separate table for each variable. And I use separate columns year, month, day, hour, minute, second - to mark the time for a record (the records are not necessarily ordered in time) and this is for each variable. I was thinking to put all the variables in the same table and put missing values for the variables which do not have outputs for a given time step. Is it possible to put None as a default value into a table (so I could easily filter dummy rows). It is not possible to use None since that is a Python object of a different type than the other integers you are trying to stick in the column. I would suggest that you use values with no actual meaning. If you are using normal ints you can use -1 to represent missing values. If you are using unsigned ints you have to pick other values, like 13 for month on the Julian calendar. But then again the data comes in chunks, does this mean I would have to check if a row with the same date already exist for a different variable? No you wouldn't you can store the same data multiple times in different rows. I don't really like the ideas in 2, which are intended to save space, but maybe all I need is a good compression level? Can somebody advise me on this? Compression would definitely help here since the date numbers are all fairly similar. Probably even a compression level of 1 would work. Keep in mind that sometime using compression actually speeds things up (see the starving CPU problem). You might just need to experiment with a few different compression level to see how things go. 0, 1, 5, 9 gives you a good spread. Be Well Anthony Cheers -- Oleksandr (Sasha) Huziy -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Clear chunks from CArray
Hello Giovanni, I think you may need to del that slice and then possibly repack. Hope this helps. Be Well Anthony On Mon, Aug 5, 2013 at 2:09 PM, Giovanni Luca Ciampaglia glciamp...@gmail.com wrote: Hello all, is there a way to clear out a chunk from a CArray? I noticed that setting the data to zero actually takes disk space, i.e. *** from tables import open_file, BoolAtom h5f = open_file('test.h5', 'w') ca = h5f.create_carray(h5f.root, 'carray', BoolAtom(), shape=(1000,1000), chunkshape=(1,1000)) ca[:,:] = False h5f.close() *** The resulting file takes 249K ... Best, -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciam...@indiana.edu -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Clear chunks from CArray
On Mon, Aug 5, 2013 at 3:14 PM, Giovanni Luca Ciampaglia glciamp...@gmail.com wrote: Hi Anthony, what do you mean precisely? I tried del ca[:,:] but CArray does not support __delitem__. Looking in the documentation I could only find a method called remove_rows, but it's in Table, not CArray. Maybe I am missing something? Huh, it should... This is definitely an oversight on our part. If you could please open an issue for this -- or better yet -- write a pull request that implements delitem, that'd be great! So I think you are right that there is no current way to delete rows from a CArray. Oops! (Of course, I may still be missing something as well). It looks like EArray also has this problem too, otherwise I would just tell you to use that. Be Well Anthony Thank, Giovanni On Mon 05 Aug 2013 03:43:42 PM EDT, pytables-users-requ...@lists.sourceforge.net wrote: Hello Giovanni, I think you may need to del that slice and then possibly repack. Hope this helps. Be Well Anthony On Mon, Aug 5, 2013 at 2:09 PM, Giovanni Luca Ciampaglia glciamp...@gmail.com wrote: Hello all, is there a way to clear out a chunk from a CArray? I noticed that setting the data to zero actually takes disk space, i.e. *** from tables import open_file, BoolAtom h5f = open_file('test.h5', 'w') ca = h5f.create_carray(h5f.root, 'carray', BoolAtom(), shape=(1000,1000), chunkshape=(1,1000)) ca[:,:] = False h5f.close() *** The resulting file takes 249K ... Best, -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ? 910 E 10th St ? Bloomington ? IN 47408 ?http://cnets.indiana.edu/ ?gciam...@indiana.edu -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciam...@indiana.edu -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=48897031iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Tables vs Arrays
On Sun, Jul 28, 2013 at 8:38 PM, David Reed david.ree...@gmail.com wrote: I'm really trying to become more productive using PyTables, but am struggling with what I should be using. Whats the difference between a table and an array? Hi David, The difference between Arrays and Tables, conceptually is the same as the different between numpy arrays and numpy structured arrays. The plain old [Aa]rray is a continuous block of a single data type. Tables and structured arrays have a more complex data type that is composed of a continuous sequence of other data types (ie the fields / columns). Which data structure you use really depends a lot of the type of problem you are trying to solve and what kinds of questions you want to answer with that data structure. That said, the implementation of Tables is far more similar to EArrays than Arrays. So a lot of the performance trade offs that you see are similar. You should watch my HDF5 is for Lovers talk for more generic advice [1]. I hope this helps! Be Well Anthony 1. http://www.youtube.com/watch?v=Nzx0HAd3FiI -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables and Multiprocessing
On Fri, Jul 12, 2013 at 1:51 AM, Mathieu Dubois duboismathieu_g...@yahoo.fr wrote: Hi Anthony, Thank you very much for your answer (it works). I will try to remodel my code around this trick but I'm not sure it's possible because I use a framework that need arrays. I think that this method still works. You can always send back a numpy array to the main process that you pull out from a subprocess. Can somebody explain what is going on? I was thinking that PyTables keep weakref to the file for lazy loading but I'm not sure. How In any case, the PyTables community is very helpful. Glad to help! Be Well Anthony Thanks, Mathieu Le 12/07/2013 00:44, Anthony Scopatz a écrit : Hi Mathieu, I think you should try opening a new file handle per process. The following works for me on v3.0: import tables import random import multiprocessing # Reload the data # Use multiprocessing to perform a simple computation (column average) def f(filename): h5file = tables.openFile(filename, mode='r') name = multiprocessing.current_process().name column = random.randint(0, 10) print '%s use column %i' % (name, column) rtn = h5file.root.X[:, column].mean() h5file.close() return rtn p = multiprocessing.Pool(2) col_mean = p.map(f, ['test.hdf5', 'test.hdf5', 'test.hdf5']) Be well Anthony On Thu, Jul 11, 2013 at 3:43 PM, Mathieu Dubois duboismathieu_g...@yahoo.fr wrote: Le 11/07/2013 21:56, Anthony Scopatz a écrit : On Thu, Jul 11, 2013 at 2:49 PM, Mathieu Dubois duboismathieu_g...@yahoo.fr wrote: Hello, I wanted to use PyTables in conjunction with multiprocessing for some embarrassingly parallel tasks. However, it seems that it is not possible. In the following (very stupid) example, X is a Carray of size (100, 10) stored in the file test.hdf5: import tables import multiprocessing # Reload the data h5file = tables.openFile('test.hdf5', mode='r') X = h5file.root.X # Use multiprocessing to perform a simple computation (column average) def f(X): name = multiprocessing.current_process().name column = random.randint(0, n_features) print '%s use column %i' % (name, column) return X[:, column].mean() p = multiprocessing.Pool(2) col_mean = p.map(f, [X, X, X]) When executing it the following error: Exception in thread Thread-2: Traceback (most recent call last): File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner self.run() File /usr/lib/python2.7/threading.py, line 504, in run self.__target(*self.__args, **self.__kwargs) File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in _handle_tasks put(task) PicklingError: Can't pickle type 'weakref': attribute lookup __builtin__.weakref failed I have googled for weakref and pickle but can't find a solution. Any help? Hello Mathieu, I have used multiprocessing and files opened in read mode many times so I am not sure what is going on here. Thanks for your answer. Maybe you can point me to an working example? Could you provide the test.hdf5 file so that we could try to reproduce this. Here is the script that I have used to generate the data: import tables import numpy # Create data store it n_features = 10 n_obs = 100 X = numpy.random.rand(n_obs, n_features) h5file = tables.openFile('test.hdf5', mode='w') Xatom = tables.Atom.from_dtype(X.dtype) Xhdf5 = h5file.createCArray(h5file.root, 'X', Xatom, X.shape) Xhdf5[:] = X h5file.close() I hope it's not a stupid mistake. I am using PyTables 2.3.1 on Ubuntu 12.04 (libhdf5 is 1.8.4patch1). By the way, I have noticed that by slicing a Carray, I get a numpy array (I created the HDF5 file with numpy). Therefore, everything is copied to memory. Is there a way to avoid that? Only the slice that you ask for is brought into memory an it is returned as a non-view numpy array. OK. I may be careful about that. Be Well Anthony Mathieu -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today!http://pubads.g.doubleclick.net
Re: [Pytables-users] HDF5/PyTables/NumPy Question
Hi Robert, Glad these materials can be helpful. (Note: these questions really should be asked on the pytables-users mailing list -- CC'd here -- so please join that list: https://lists.sourceforge.net/lists/listinfo/pytables-users) On Fri, Jul 12, 2013 at 12:48 PM, Robert Nelson rrnel...@atmos.colostate.edu wrote: Dr. Scopatz, I came across your SciPy 2012 HDF5 is for lovers video and thought you might be able to help me. I'm trying to read large (1GB) HDF files and do multidimensional indexing (with repeated values) on them. I saw a posthttp://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg02586.htmlof yours from over a year ago saying that the best solution would be to convert it to a NumPy array but this takes too long. I think that the strategy is the same as before. Ask (to the best of my recollection) did not open an issue and so no changes have been made to PyTables to handle this. Also in this strategy, you should only be loading in the indices to start with. I doubt (though I could be wrong) that you have 1 Gb worth of index data alone. The whole idea here is to do a unique (set) and a sort operation on the much smaller index data AND THEN use fancy indexing to pull the actual data back out. As always some sample code and a sample file would be extremely helpful. I don't think I can do much more for you without these. Be Well Anthony Have there been any updates in PyTables that would make this possible? Thank you! Robert Nelson Colorado State University rob.r.nel...@gmail.com 763-354-8411 -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables and Multiprocessing
Hi Mathieu, I think you should try opening a new file handle per process. The following works for me on v3.0: import tables import random import multiprocessing # Reload the data # Use multiprocessing to perform a simple computation (column average) def f(filename): h5file = tables.openFile(filename, mode='r') name = multiprocessing.current_process().name column = random.randint(0, 10) print '%s use column %i' % (name, column) rtn = h5file.root.X[:, column].mean() h5file.close() return rtn p = multiprocessing.Pool(2) col_mean = p.map(f, ['test.hdf5', 'test.hdf5', 'test.hdf5']) Be well Anthony On Thu, Jul 11, 2013 at 3:43 PM, Mathieu Dubois duboismathieu_g...@yahoo.fr wrote: Le 11/07/2013 21:56, Anthony Scopatz a écrit : On Thu, Jul 11, 2013 at 2:49 PM, Mathieu Dubois duboismathieu_g...@yahoo.fr wrote: Hello, I wanted to use PyTables in conjunction with multiprocessing for some embarrassingly parallel tasks. However, it seems that it is not possible. In the following (very stupid) example, X is a Carray of size (100, 10) stored in the file test.hdf5: import tables import multiprocessing # Reload the data h5file = tables.openFile('test.hdf5', mode='r') X = h5file.root.X # Use multiprocessing to perform a simple computation (column average) def f(X): name = multiprocessing.current_process().name column = random.randint(0, n_features) print '%s use column %i' % (name, column) return X[:, column].mean() p = multiprocessing.Pool(2) col_mean = p.map(f, [X, X, X]) When executing it the following error: Exception in thread Thread-2: Traceback (most recent call last): File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner self.run() File /usr/lib/python2.7/threading.py, line 504, in run self.__target(*self.__args, **self.__kwargs) File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in _handle_tasks put(task) PicklingError: Can't pickle type 'weakref': attribute lookup __builtin__.weakref failed I have googled for weakref and pickle but can't find a solution. Any help? Hello Mathieu, I have used multiprocessing and files opened in read mode many times so I am not sure what is going on here. Thanks for your answer. Maybe you can point me to an working example? Could you provide the test.hdf5 file so that we could try to reproduce this. Here is the script that I have used to generate the data: import tables import numpy # Create data store it n_features = 10 n_obs = 100 X = numpy.random.rand(n_obs, n_features) h5file = tables.openFile('test.hdf5', mode='w') Xatom = tables.Atom.from_dtype(X.dtype) Xhdf5 = h5file.createCArray(h5file.root, 'X', Xatom, X.shape) Xhdf5[:] = X h5file.close() I hope it's not a stupid mistake. I am using PyTables 2.3.1 on Ubuntu 12.04 (libhdf5 is 1.8.4patch1). By the way, I have noticed that by slicing a Carray, I get a numpy array (I created the HDF5 file with numpy). Therefore, everything is copied to memory. Is there a way to avoid that? Only the slice that you ask for is brought into memory an it is returned as a non-view numpy array. OK. I may be careful about that. Be Well Anthony Mathieu -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today!http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Pytables-users mailing listPytables-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/pytables-users -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net
Re: [Pytables-users] `__iter__` state and `itertools.islice` when
On Tue, Jul 9, 2013 at 8:57 AM, Tony Yu tsy...@gmail.com wrote: On Tue, Jul 9, 2013 at 12:58 AM, Antonio Valentino antonio.valent...@tiscali.it wrote: snip Yes, this is a bug IMO. Thank you for reporting and thank you for the small demonstration script. Can you please file a bug report on github [1]? Please also add info about the PyTables version you used for the test.. Thanks for you quick reply. Ticket filed here: https://github.com/PyTables/PyTables/issues/267 Thanks Tony, I have made my comments on the issue, but the short version is that I don't think this is a bug, iteration needs a rewrite, and you should use iterrows(). Be Well Anthony PS you should upgrade to 3.0 and use the new API :) Best, -Tony [1] https://github.com/PyTables/PyTables/issues -- Antonio Valentino -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Storing large images in PyTable
On Fri, Jul 5, 2013 at 8:40 AM, Francesc Alted fal...@gmail.com wrote: On 7/5/13 1:33 AM, Mathieu Dubois wrote: tables.tableExtension.Table._createTable (tables/tableExtension.c:2181) tables.exceptions.HDF5ExtError: Problems creating the table I think that the size of the column is too large (if I remove the Image field, everything works perfectly). Hi Mathieu, This shouldn't be the case. What is the value of IMAGE_SIZE? IMAGE_SIZE is a tuple containing (121, 145, 121). This is a bit large for a row in the Table object. My recommendation for these cases is to use an associated EArray with shape (0, 121, 145, 121) and then append the images there. You can always refer to the image by issuing a __getitem__() operation on the EArray object with the index of the row in the table. Easy as a pie and you will allow the compression library (in case you are using compression) to work much more efficiently for the table. Hi Francesc, I disagree that this shape is too large for a table. Here is a minimal example that works for me: import tables as tb import numpy as np images = np.ones(100, dtype=[('id', np.uint16), ('image', np.float32, (121, 145, 121)) ]) with tb.open_file('temp.h5', 'w') as f: f.create_table('/', 'images', images) I think that there is something else going on with the initialization but Mathieu hasn't given us enough information to figure it out =/. A minimal failing script would be super helpful here! (BTW Mathieu, Tables can also take advantage of compression. Though Francesc's solution is nicer for a lot of reason too.) Be Well Anthony HTH, -- Francesc Alted -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Storing large images in PyTable
Thanks Mathieu! I am glad this is working for you now. File this one under Mysterious Errors of the Universe :). Be Well Anthony On Fri, Jul 5, 2013 at 6:51 PM, Mathieu Dubois duboismathieu_g...@yahoo.frwrote: Hi, Sorry for the late response. First of all, I have managed to achieve what I wanted to do differently. Then the code Francesc send works well (I had to adapt it because I use version 2.3.1 under Ubuntu 12.04). I was able to reproduce something similar with a class like this (copied pasted from the tutorial): import tables as tb import numpy as np class Subject(tb.IsDescription): # Subject information Id = tb.UInt16Col() Image= tb.Float32Col(shape=(121, 145, 121)) h5file = tb.openFile(tutorial1.h5, mode = w, title = Test file) group = h5file.createGroup(/, 'subject', 'Suject information') table = h5file.createTable(group, 'readout', Subject, Readout example) subject = table.row for i in xrange(10): subject['Id'] = i subject['Image'] = np.ones((121, 145, 121)) subject.append() This code works well too. So I don't really know why nothing was working yesterday: this was the same class and a very close program. I will try to investigate later on this. Thanks for everything, Mahtieu Le 05/07/2013 16:54, Anthony Scopatz a écrit : On Fri, Jul 5, 2013 at 8:40 AM, Francesc Alted fal...@gmail.com wrote: On 7/5/13 1:33 AM, Mathieu Dubois wrote: tables.tableExtension.Table._createTable (tables/tableExtension.c:2181) tables.exceptions.HDF5ExtError: Problems creating the table I think that the size of the column is too large (if I remove the Image field, everything works perfectly). Hi Mathieu, This shouldn't be the case. What is the value of IMAGE_SIZE? IMAGE_SIZE is a tuple containing (121, 145, 121). This is a bit large for a row in the Table object. My recommendation for these cases is to use an associated EArray with shape (0, 121, 145, 121) and then append the images there. You can always refer to the image by issuing a __getitem__() operation on the EArray object with the index of the row in the table. Easy as a pie and you will allow the compression library (in case you are using compression) to work much more efficiently for the table. Hi Francesc, I disagree that this shape is too large for a table. Here is a minimal example that works for me: import tables as tb import numpy as np images = np.ones(100, dtype=[('id', np.uint16), ('image', np.float32, (121, 145, 121)) ]) with tb.open_file('temp.h5', 'w') as f: f.create_table('/', 'images', images) I think that there is something else going on with the initialization but Mathieu hasn't given us enough information to figure it out =/. A minimal failing script would be super helpful here! (BTW Mathieu, Tables can also take advantage of compression. Though Francesc's solution is nicer for a lot of reason too.) Be Well Anthony HTH, -- Francesc Alted -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing listPytables-users@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Storing large images in PyTable
On Thu, Jul 4, 2013 at 4:13 PM, Mathieu Dubois duboismathieu_g...@yahoo.frwrote: Hello, I'm a beginner with Pyable. I wanted to store a database in a HDF5 file using PyTable. The DB is made by a CSV file (which contains the subject information) and a lot of images (I work on MRI so the images are 3 dimensional float32 arrays of shape (121, 145, 121)). The relation is very simple: there are a 3 images per subject. My first idea was to create a class Subject like this: class Subject(tables.IsDescription): # Subject information Id = tables.UInt16Col() ... Image= tables.Float32Col(shape=IMAGE_SIZE) And the proceed like in the tutorial (open a file, create a group and a table associated to the Subject class and then append data to this table). Unfortunately I got an error when creating the table (even before inserting data): HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 140612945950464: #000: ../../../src/H5Ddeprec.c line 170 in H5Dcreate1(): unable to create dataset major: Dataset minor: Unable to initialize object #001: ../../../src/H5Dint.c line 428 in H5D_create_named(): unable to create and link to dataset major: Dataset minor: Unable to initialize object #002: ../../../src/H5L.c line 1639 in H5L_link_object(): unable to create new link to object major: Links minor: Unable to initialize object #003: ../../../src/H5L.c line 1862 in H5L_create_real(): can't insert link major: Symbol table minor: Unable to insert object #004: ../../../src/H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed major: Symbol table minor: Object not found #005: ../../../src/H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed major: Symbol table minor: Callback failed #006: ../../../src/H5L.c line 1685 in H5L_link_cb(): unable to create object major: Object header minor: Unable to initialize object #007: ../../../src/H5O.c line 2677 in H5O_obj_create(): unable to open object major: Object header minor: Can't open object #008: ../../../src/H5Doh.c line 296 in H5O_dset_create(): unable to create dataset major: Dataset minor: Unable to initialize object #009: ../../../src/H5Dint.c line 1034 in H5D_create(): can't update the metadata cache major: Dataset minor: Unable to initialize object #010: ../../../src/H5Dint.c line 799 in H5D_update_oh_info(): unable to update new fill value header message major: Dataset minor: Unable to initialize object #011: ../../../src/H5Omessage.c line 188 in H5O_msg_append_oh(): unable to create new message in header major: Attribute minor: Unable to insert object #012: ../../../src/H5Omessage.c line 228 in H5O_msg_append_real(): unable to create new message major: Object header minor: No space available for allocation #013: ../../../src/H5Omessage.c line 1940 in H5O_msg_alloc(): unable to allocate space for message major: Object header minor: Unable to initialize object #014: ../../../src/H5Oalloc.c line 1032 in H5O_alloc(): object header message is too large major: Object header minor: Unable to initialize object Traceback (most recent call last): File 00_build_dataset.tmp.py, line 52, in module dump_in_hdf5(**vars(args)) File 00_build_dataset.tmp.py, line 32, in dump_in_hdf5 data_api.Subject) File /usr/lib/python2.7/dist-packages/tables/file.py, line 770, in createTable chunkshape=chunkshape, byteorder=byteorder) File /usr/lib/python2.7/dist-packages/tables/table.py, line 832, in __init__ byteorder, _log) File /usr/lib/python2.7/dist-packages/tables/leaf.py, line 291, in __init__ super(Leaf, self).__init__(parentNode, name, _log) File /usr/lib/python2.7/dist-packages/tables/node.py, line 296, in __init__ self._v_objectID = self._g_create() File /usr/lib/python2.7/dist-packages/tables/table.py, line 983, in _g_create self._v_new_title, self.filters.complib or '', obversion ) File tableExtension.pyx, line 195, in tables.tableExtension.Table._createTable (tables/tableExtension.c:2181) tables.exceptions.HDF5ExtError: Problems creating the table I think that the size of the column is too large (if I remove the Image field, everything works perfectly). Hi Mathieu, This shouldn't be the case. What is the value of IMAGE_SIZE? Be Well Anthony Therefore what is the best way to store the images (while keeping the relation)? I have read various post about this subject on the web but could not find a definitive answer (the more helpful was http://stackoverflow.com/questions/8843062/python-how-to-store-a-numpy-multidimensional-array-in-pytables ). I was thinking to create an extensible array and store each image in the same order than the subject. However, I
Re: [Pytables-users] writing metadata
Also, depending on how much meta data you really needed to store you could just use attributes. That is what they are there for. On Tue, Jun 25, 2013 at 10:06 AM, Josh Ayers josh.ay...@gmail.com wrote: Another option is to create a Python object - dict, list, or whatever works - containing the metadata and then store a pickled version of it in a PyTables array. It's nice for this sort of thing because you have the full flexibility of Python's data containers. For example, if the Python object is called 'fit', then numpy.frombuffer(pickle.dumps(fit), 'u1') will pickle it and convert the result to a NumPy array of unsigned bytes. It can be stored in a PyTables array using a UInt8Atom. To retrieve the Python object, just use pickle.loads(hdf5_file.root.data_1.fit[:]). It gets a little more complicated if you want to be able to modify the Python object, because the length of the pickle will change. In that case, you can use an EArray (for the case when the pickle grows), and store the number of bytes as an attribute. Storing the number of bytes handles the case when the pickle shrinks and doesn't use the full length of the on-disk array. To load it, use pickle.loads(hdf5_file.root.data_1.fit[:num_bytes]), where num_bytes is the previously stored attribute. To modify it, just overwrite the array with the new version, expanding if necessary, then update the num_bytes attribute. Using a PyTables VLArray with an 'object' atom uses a similar technique under the hood, so that may be easier. It doesn't allow resizing though. Hope that helps, Josh On Tue, Jun 25, 2013 at 1:33 AM, Andreas Hilboll li...@hilboll.de wrote: On 25.06.2013 10:26, Andre' Walker-Loud wrote: Dear PyTables users, I am trying to figure out the best way to write some metadata into some files I have. The hdf5 file looks like /root/data_1/stat /root/data_1/sys where stat and sys are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object /root/data_1/fit where fit is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. I began to follow the example in the PyTables manual, in Section 1.2 The Object Tree, where first a class is defined class Particle(tables.IsDescription): identity = tables.StringCol(itemsize=22, dflt= , pos=0) ... and then this class is used to populate a table. In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The Table does not seem optimal, but I don't see what else I would use. For complex information I'd probably indeed use a table object. It doesn't matter if the table only has one row, but still you have all the information there nicely structured. -- Andreas. -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Speed of CArray writing sparse matrices
Hello Giovanni, Great to hear that everything is working much better for you now and that everything is much faster and smaller than NPY ;) Do you know how the default value is set btw? This is computed via a magical heuristic algorithm written by Francesc (?) called computechunksize(). This is really optimized for dense data (Tables) so it is not surprising that in performs poorly in your case. Any updates you want to make to PyTables to also handle sparse data well out of the box would be very welcome ;) 1. https://github.com/PyTables/PyTables/blob/develop/tables/idxutils.py#L54 On Mon, Jun 24, 2013 at 10:51 AM, Giovanni Luca Ciampaglia glciamp...@gmail.com wrote: Hi Anthony, thanks for the explanation and the links, it's much clearer now. So without compression a CArray is really a smarter type of sparse file, but you have to set a sensible chunk shape. Do you know how the default value is set btw? I am asking because I didn't see any change in performance from using the default value and using (1, N), where (N,N) is the shape of the matrix. I guess that the write performance depends crucially on the size of the I/O buffer, so the default must be choosing a similar setting. Anyway I have played a bit with other values of the chunk shape in conjunction with the compression level and using a shape (1,100) and a complevel=5 gives speeds that are only 10-15% slower than what I get at shape=(1,1) and complevel=0. The resulting file is 10 times smaller, and something like 35 times smaller than a NPY sparse file, btw! Thanks! Giovanni On 06/24/2013 05:25 AM, pytables-users-request@lists.sourceforge.netwrote: Hi Giovanni! I think that you may have some misunderstanding about how chucking works, which is leading you to get terrible performance. In fact what you describe is a great strategy (right all and zip) for using normal Arrays. However, chunking and CArrays don't work like this. If a chunk contains no data, it is not written at all! Also, all zipping takes place on the chunk level. Thus for very small chunks you can actually increase the file size and access time by using compression. For sparse matrices and CArrays, you need to play around with the chunkshape argument to create_carray() and compression. Performance is going to be affected how dense the matrix is and how grouped it is. For example, for a very dense and randomly distributed matrix, chunkshape=1 and no compression is best. For block diagonal matrices, the chunkshape should be the nominal block shape. Compression is only useful here if the blocks all have similar values or the block shape is large. For example 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 is well suited to a chunkshape=(2, 2) For more information on the HDF model please see my talk slides and video [1,2] I hope this helps. Be Well Anthony PS. Glad to see you using the new API 1.https://github.com/scopatz/hdf5-is-for-lovers 2.http://www.youtube.com/watch?v=Nzx0HAd3FiI -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciam...@indiana.edu -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Speed of in-kernel Full-Table Search
On Mon, Jun 24, 2013 at 4:25 AM, Wagner Sebastian sebastian.wagner...@ait.ac.at wrote: Dear PyTables-Users, ** ** For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. ** ** So the thing I don’t understand: How can PyTables be so fast without any Indexing? Hi Sebastian, First, there is no longer a non-free version of PyTables and v3.0 *does* have indexing capabilities. However, you have to enable them so you probably weren't using them. PyTables is fast because HDF5 is a binary format, it using pthreads under the covers to parallelize some tasks, and it uses numexpr (which is also parallel) to evaluate many expressions. All of these things help make PyTables great! Be Well Anthony ** ** I’m using 3.0.0rc2 coming with WinPython ** ** Regards, Sebastian -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Speed of CArray writing sparse matrices
Hi Giovanni! I think that you may have some misunderstanding about how chucking works, which is leading you to get terrible performance. In fact what you describe is a great strategy (right all and zip) for using normal Arrays. However, chunking and CArrays don't work like this. If a chunk contains no data, it is not written at all! Also, all zipping takes place on the chunk level. Thus for very small chunks you can actually increase the file size and access time by using compression. For sparse matrices and CArrays, you need to play around with the chunkshape argument to create_carray() and compression. Performance is going to be affected how dense the matrix is and how grouped it is. For example, for a very dense and randomly distributed matrix, chunkshape=1 and no compression is best. For block diagonal matrices, the chunkshape should be the nominal block shape. Compression is only useful here if the blocks all have similar values or the block shape is large. For example 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 is well suited to a chunkshape=(2, 2) For more information on the HDF model please see my talk slides and video :) [1,2] I hope this helps. Be Well Anthony PS. Glad to see you using the new API ;) 1. https://github.com/scopatz/hdf5-is-for-lovers 2. http://www.youtube.com/watch?v=Nzx0HAd3FiI On Sat, Jun 22, 2013 at 6:34 PM, Giovanni Luca Ciampaglia glciamp...@gmail.com wrote: Hi all, I have a sparse 3.4M x 3.4M adjacency matrix with nnz = 23M and wanted to see if CArray was an appropriate solution for storing it. Right now I am using the NumPy binary format for storing the data in coordinate format and loading the matrix with Scipy's sparse coo_matrix class. As far as I understand, with CArray the matrix would be written in full (zeros included) but a) since it's chunked accessing it does not take memory and b) with compression enabled it would possible to keep the size of the file reasonable. If my assumptions are correct, then here is my problem: I am running into problems when writing the CArray to disk. I adapted the example from the documentation [1] and when I run the code on a 6000x6000 matrix with nnz = 17K I achieve a decent speed of roughly 4100 elements/s. However, when I try it on the full matrix the writing speed drops to 4 elements/s. Am I doing something wrong? Any feedback would be greatly appreciated! Code: https://gist.github.com/junkieDolphin/5843064 Cheers, Giovanni [1] http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-carray-class -- Giovanni Luca Ciampaglia ☞ http://www.inf.usi.ch/phd/ciampaglia/ ✆ (812) 287-3471 ✉ glciamp...@gmail.com -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] append to multiple tables
Hi Ed, Are you inside of a nested loop? You probably just need to flush after the innermost loop. Do you have some sample code you can share? Be Well Anthony On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel edwardvog...@gmail.comwrote: I have a dataset that I want to split between two tables. But, when I iterate over the data and append to both tables, I get a warning: /usr/local/lib/python2.7/site-packages/tables/table.py:2967: PerformanceWarning: table ``/cv2`` is being preempted from alive nodes without its buffers being flushed or with some index being dirty. This may lead to very ineficient use of resources and even to fatal errors in certain situations. Please do a call to the .flush() or .reindex_dirty() methods on this table before start using other nodes. However, if I flush after every append, I get awful performance. Is there a correct way to append to two tables without doing a flush? Note, I don't have any indices defined, so it seems reindex_dirty() doesn't apply. Thanks, Ed -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Chunk selection for optimized data access
Thanks Antonio and Tim! These are great. I think that one of these should definitely make it into the examples/ dir. Be Well Anthony On Wed, Jun 5, 2013 at 8:10 AM, Francesc Alted fal...@gmail.com wrote: On 6/5/13 11:45 AM, Andreas Hilboll wrote: On 05.06.2013 10:31, Andreas Hilboll wrote: On 05.06.2013 03:29, Tim Burgess wrote: I was playing around with in-memory HDF5 prior to the 3.0 release. Here's an example based on what I was doing. I looked over the docs and it does mention that there is an option to throw away the 'file' rather than write it to disk. Not sure how to do that and can't actually think of a use case where I would want to :-) And be wary, it is H5FD_CORE. On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote: I think that you want to set parameters.DRIVER to H5DF_CORE [1]. I haven't ever used this personally, but it would be great to have an example script, if someone wants to write one ;) import numpy as np import tables CHUNKY = 30 CHUNKX = 8640 if __name__ == '__main__': # create dataset and add global attrs file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX) with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory example', driver='H5FD_CORE') as h5f: # dummy some data lats = np.empty([4320]) lons = np.empty([8640]) # create some simple arrays lat_node = h5f.create_array('/', 'lat', lats, title='latitude') lon_node = h5f.create_array('/', 'lon', lons, title='longitude') # create a 365 x 4320 x 8640 CArray of 32bit float shape = (365, 4320, 8640) atom = tables.Float32Atom(dflt=np.nan) # chunk into daily slices and then further chunk days sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape, chunkshape=(1, CHUNKY, CHUNKX)) # dummy up an ndarray sst = np.empty([4320, 8640], dtype=np.float32) sst.fill(30.0) # write ndarray to a 2D plane in the HDF5 sst_node[0] = sst Thanks Tim, I adapted your example for my use case (I'm using the EArray class, because I need to continuously update my database), and it works well. However, when I use this with my own data (but also creating the arrays like you did), I'm running into errors like Could not wait on barrier. It seems like the HDF library is spawing several threads. Any idea what's going wrong? Can I somehow avoid HDF5 multithreading at runtime? Update: When setting max_blosc_threads=2 and max_numexpr_threads=2, everything seems to work as expected (but a bit on the slow side ...). BTW, can you really notice the difference between using 1, 2 or 4 threads? Can you show some figures? Just curious. -- Francesc Alted -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] pytable 30 - encoding
Hi Jeff, I have made some comments in the issue. Thanks for investigating this so thoroughly. Be Well Anthony On Tue, Jun 4, 2013 at 8:16 PM, Jeff Reback jreb...@yahoo.com wrote: Anthony, I created an issue with more info I am not sure if this is a bug, or just a way both ne/pytables treat strings that need to touch an encoded value; I found workaround by specifying the condvars to readWhere. Any more thoughts on this? thanks Jeff https://github.com/PyTables/PyTables/issues/265 I can be reached on my cell (917)971-6387 *From:* Anthony Scopatz scop...@gmail.com *To:* Jeff Reback j...@reback.net *Cc:* Discussion list for PyTables pytables-users@lists.sourceforge.net *Sent:* Tuesday, June 4, 2013 6:39 PM *Subject:* Re: [Pytables-users] pytable 30 - encoding Hi Jeff, Hmmm, Could you try doing the same thing on just an in-memory numpy array using numexpr. If this succeeds it tells us that the problem is in PyTables, not numexpr. Be Well Anthony On Tue, Jun 4, 2013 at 11:35 AM, Jeff Reback jreb...@yahoo.com wrote: Anthony, I am using numexpr 2.1 (latest) this is puzzling; doesn't matter what I pass (bytes or str) , same result? (column == 'str-2') /mnt/code/arb/test/pytables-3.py(38)module() - result = handle.root.test.table.readWhere(selector) (Pdb) handle.root.test.table.readWhere(selector) *** TypeError: string argument without an encoding (Pdb) handle.root.test.table.readWhere(selector.encode(encoding)) *** TypeError: string argument without an encoding (Pdb) *From:* Anthony Scopatz scop...@gmail.com *To:* Jeff Reback j...@reback.net; Discussion list for PyTables pytables-users@lists.sourceforge.net *Sent:* Tuesday, June 4, 2013 12:25 PM *Subject:* Re: [Pytables-users] pytable 30 - encoding Hi Jeff, Have you also updated numexpr to the most recent version? The error is coming from numexpr not compiling the expression correctly. Also, you might try making selector a str, rather than bytes: selector = (column == 'str-2') rather than selector = (column == 'str-2').encode(encoding) Be Well Anthony On Tue, Jun 4, 2013 at 8:51 AM, Jeff Reback jreb...@yahoo.com wrote: anthony,where am I going wrong here? #!/usr/local/bin/python3 import tables import numpy as np import datetime, time encoding = 'UTF-8' test_file = 'test_select.h5' handle = tables.openFile(test_file, w) node = handle.createGroup(handle.root, 'test') table = handle.createTable(node, 'table', dict( index = tables.Int64Col(), column = tables.StringCol(25), values = tables.FloatCol(shape=(3)), )) # add data r = table.row for i in range(10): r['index'] = i r['column'] = (str-%d % (i % 5)).encode(encoding) r['values'] = np.arange(3) r.append() table.flush() handle.close() # read handle = tables.openFile(test_file,r) result = handle.root.test.table.read() print(table data\n) print(result) # where print(\nselector\n) selector = (column == 'str-2').encode(encoding) print(selector) result = handle.root.test.table.readWhere(selector) print(result) and the following out: [sheep-jreback-/code/arb/test] python3 pytables-3.py table data [(b'str-0', 0, [0.0, 1.0, 2.0]) (b'str-1', 1, [0.0, 1.0, 2.0]) (b'str-2', 2, [0.0, 1.0, 2.0]) (b'str-3', 3, [0.0, 1.0, 2.0]) (b'str-4', 4, [0.0, 1.0, 2.0]) (b'str-0', 5, [0.0, 1.0, 2.0]) (b'str-1', 6, [0.0, 1.0, 2.0]) (b'str-2', 7, [0.0, 1.0, 2.0]) (b'str-3', 8, [0.0, 1.0, 2.0]) (b'str-4', 9, [0.0, 1.0, 2.0])] selector b(column == 'str-2') Traceback (most recent call last): File pytables-3.py, line 37, in module result = handle.root.test.table.readWhere(selector) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/_past.py, line 35, in oldfunc return obj(*args, **kwargs) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py, line 1522, in read_where self._where(condition, condvars, start, stop, step)] File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py, line 1484, in _where compiled = self._compile_condition(condition, condvars) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py, line 1358, in _compile_condition compiled = compile_condition(condition, typemap, indexedcols) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/conditions.py, line 419, in compile_condition func = NumExpr(expr, signature) File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 559, in NumExpr precompile(ex, signature, context) File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 511, in precompile constants_order, constants = getConstants(ast) File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 294, in getConstants
Re: [Pytables-users] pytable 30 - encoding
Hi Jeff, Have you also updated numexpr to the most recent version? The error is coming from numexpr not compiling the expression correctly. Also, you might try making selector a str, rather than bytes: selector = (column == 'str-2') rather than selector = (column == 'str-2').encode(encoding) Be Well Anthony On Tue, Jun 4, 2013 at 8:51 AM, Jeff Reback jreb...@yahoo.com wrote: anthony,where am I going wrong here? #!/usr/local/bin/python3 import tables import numpy as np import datetime, time encoding = 'UTF-8' test_file = 'test_select.h5' handle = tables.openFile(test_file, w) node = handle.createGroup(handle.root, 'test') table = handle.createTable(node, 'table', dict( index = tables.Int64Col(), column = tables.StringCol(25), values = tables.FloatCol(shape=(3)), )) # add data r = table.row for i in range(10): r['index'] = i r['column'] = (str-%d % (i % 5)).encode(encoding) r['values'] = np.arange(3) r.append() table.flush() handle.close() # read handle = tables.openFile(test_file,r) result = handle.root.test.table.read() print(table data\n) print(result) # where print(\nselector\n) selector = (column == 'str-2').encode(encoding) print(selector) result = handle.root.test.table.readWhere(selector) print(result) and the following out: [sheep-jreback-/code/arb/test] python3 pytables-3.py table data [(b'str-0', 0, [0.0, 1.0, 2.0]) (b'str-1', 1, [0.0, 1.0, 2.0]) (b'str-2', 2, [0.0, 1.0, 2.0]) (b'str-3', 3, [0.0, 1.0, 2.0]) (b'str-4', 4, [0.0, 1.0, 2.0]) (b'str-0', 5, [0.0, 1.0, 2.0]) (b'str-1', 6, [0.0, 1.0, 2.0]) (b'str-2', 7, [0.0, 1.0, 2.0]) (b'str-3', 8, [0.0, 1.0, 2.0]) (b'str-4', 9, [0.0, 1.0, 2.0])] selector b(column == 'str-2') Traceback (most recent call last): File pytables-3.py, line 37, in module result = handle.root.test.table.readWhere(selector) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/_past.py, line 35, in oldfunc return obj(*args, **kwargs) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py, line 1522, in read_where self._where(condition, condvars, start, stop, step)] File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py, line 1484, in _where compiled = self._compile_condition(condition, condvars) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py, line 1358, in _compile_condition compiled = compile_condition(condition, typemap, indexedcols) File /usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/conditions.py, line 419, in compile_condition func = NumExpr(expr, signature) File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 559, in NumExpr precompile(ex, signature, context) File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 511, in precompile constants_order, constants = getConstants(ast) File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 294, in getConstants for a in constants_order] File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 294, in listcomp for a in constants_order] File /usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py, line 284, in convertConstantToKind return kind_to_type[kind](x) TypeError: string argument without an encoding Closing remaining open files: test_select.h5... done -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Chunk selection for optimized data access
Hi Andreas, First off, nothing should be this bad, but What is the data type of the array? Also are you selecting chunksize manually or letting PyTables figure it out? Here are some things that you can try: 1. Query with fancy indexing, once. That is, rather than using a list comprehension just say, _a[zip(*idx)] 2. set _a.nrowsinbuf [1] to a much smaller value (1, 5, or 10) which is more appropriate for pulling out individual indexes. Lastly, it is my opinion that the iteration mechanics are slower than they can / should be. I have a bunch of ideas about how to make them faster AND clean up the code base but I won't have a ton of time to work on them in the near term. However, if this is something that you are interested in, that would be great! I'd love to help out anyone who was willing to take this on. Be Well Anthony 1. http://pytables.github.io/usersguide/libref/hierarchy_classes.html#tables.Leaf.nrowsinbuf On Mon, Jun 3, 2013 at 7:45 AM, Andreas Hilboll li...@hilboll.de wrote: On 03.06.2013 14:43, Andreas Hilboll wrote: Hi, I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray (the last dimension represents time, and once per month there'll be one more 5760x2880 array to add to the end). Now, extracting timeseries at one index location is slow; e.g., for four indices, it takes several seconds: In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1)) In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)]) CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s Wall time: 7.17 s I have the feeling that this performance could be improved, but I'm not sure about how to properly use the `chunkshape` parameter in my case. Any help is greatly appreciated :) Cheers, Andreas. PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option. -- Andreas. -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
[Pytables-users] Anyone want to present at PyData Boston, July 27-28th
Hey everyone, Leah Silen (CC'd) of NumFOCUS was wondering if anyone wanted to give a talk or tutorial about PyTables at PyData Boston [1]. I don't think that I'll be able to make it, but I highly encourage others to take her up on this. This sort of thing shouldn't be too hard to put together since I have already assembled a repo of slides and exercises for a 4 hour long tutorial [2]. Feel free to use them! Be Well Anthony 1. http://pydata.org/bos2013/ 2. https://github.com/scopatz/hdf5-is-for-lovers -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Chunk selection for optimized data access
Opps! I forgot to mention CArray! On Mon, Jun 3, 2013 at 10:35 PM, Tim Burgess timburg...@mac.com wrote: My thoughts are: - try it without any compression. Assuming 32 bit floats, your monthly 5760 x 2880 is only about 65MB. Uncompressed data may perform well and at the least it will give you a baseline to work from - and will help if you are investigating IO tuning. - I have found with CArray that the auto chunksize works fairly well. Experiment with that chunksize and with some chunksizes that you think are more appropriate (maybe temporal rather than spatial in your case). On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote: On 03.06.2013 14:43, Andreas Hilboll wrote: Hi, I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray (the last dimension represents time, and once per month there'll be one more 5760x2880 array to add to the end). Now, extracting timeseries at one index location is slow; e.g., for four indices, it takes several seconds: In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1)) In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)]) CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s Wall time: 7.17 s I have the feeling that this performance could be improved, but I'm not sure about how to properly use the `chunkshape` parameter in my case. Any help is greatly appreciated :) Cheers, Andreas. PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option. -- Andreas. -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] ANN: PyTables 3.0 final
Congratulations All! This is a huge and important milestone for PyTables and I am glad to have been a part of it! Be Well Anthony On Sat, Jun 1, 2013 at 6:33 AM, Antonio Valentino antonio.valent...@tiscali.it wrote: === Announcing PyTables 3.0.0 === We are happy to announce PyTables 3.0.0. PyTables 3.0.0 comes after about 5 years from the last major release (2.0) and 7 months since the last stable release (2.4.0). This is new major release and an important milestone for the PyTables project since it provides the long waited support for Python 3.x, which has been around for 4 years. Almost all of the core numeric/scientific packages for Python already support Python 3 so we are very happy that now also PyTables can provide this important feature. What's new == A short summary of main new features: - Since this release, PyTables now provides full support to Python 3 - The entire code base is now more compliant with coding style guidelines described in PEP8. - Basic support for HDF5 drivers. It now is possible to open/create an HDF5 file using one of the SEC2, DIRECT, LOG, WINDOWS, STDIO or CORE drivers. - Basic support for in-memory image files. An HDF5 file can be set from or copied into a memory buffer. - Implemented methods to get/set the user block size in a HDF5 file. - All read methods now have an optional *out* argument that allows to pass a pre-allocated array to store data. - Added support for the floating point data types with extended precision (Float96, Float128, Complex192 and Complex256). - Consistent ``create_xxx()`` signatures. Now it is possible to create all data sets Array, CArray, EArray, VLArray, and Table from existing Python objects. - Complete rewrite of the `nodes.filenode` module. Now it is fully compliant with the interfaces defined in the standard `io` module. Only non-buffered binary I/O is supported currently. Please refer to the RELEASE_NOTES document for a more detailed list of changes in this release. As always, a large amount of bugs have been addressed and squashed as well. In case you want to know more in detail what has changed in this version, please refer to: http://pytables.github.io/release_notes.html You can download a source package with generated PDF and HTML docs, as well as binaries for Windows, from: http://sourceforge.net/projects/pytables/files/pytables/3.0.0 For an online version of the manual, visit: http://pytables.github.io/usersguide/index.html What it is? === PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. PyTables includes OPSI, a new indexing technology, allowing to perform data lookups in tables exceeding 10 gigarows (10**10 rows) in less than a tenth of a second. Resources = About PyTables: http://www.pytables.org About the HDF5 library: http://hdfgroup.org/HDF5/ About NumPy: http://numpy.scipy.org/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Most specially, a lot of kudos go to the HDF5 and NumPy makers. Without them, PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Developers -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] How much extra metadata does PyTables insert?
On Sun, May 26, 2013 at 11:04 AM, Nolan Phillips ncphillips...@gmail.comwrote: Hi, I have a question about the metadata that PyTables inserts into the HDF5 files. Is this data stored in the files themselves, but just not user defined? The important question is, does this metadata make the HDF5 files inaccessible by other means, such as the standard C library or H5Py? Hi Nolan, The PyTables-specific metadata is for PyTables (and ViTables) consumption only and does not (or should not) interfere with other methods of HDF5 consumption. Since PyTables and h5py both link to the hdf5 library, I have never had any interoperability problems. Be Well Anthony Thanks! Nolan -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Ideas for effective linear ND interpolation?
[dropping scipy-user] Hello Andreas PyTables is a great option and using compression (zlib, blosc, etc) will probably help. Additionally, I would not that since your values are between [0, 100], you can probably get away with using 32-bit floats, rather than 64-bit floats. This size reduction will speed things up, but you probably don't want to go down to 16-bit floats. I would recommend that you store your dataset on disk and then use PyTables Expressions [1,2] with the out argument to keep your results on disk as well. If this strategy fails because you need to simultaneously look at multiple indexes in the same array, then I would use partially offset iterators as described in this thread [3]. In both cases, since iterators are automatically chunked, you never read in the whole dataset at one time and what you are interpolating can be as large as you want :). Let us know if you have further specific questions. Be Well Anthony 1. http://pytables.github.io/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator 2. https://github.com/scopatz/hdf5-is-for-lovers/blob/master/hdf5-is-for-lovers.pdf?raw=true 2. Nested Iteration of HDF5 using PyTables http://blog.gmane.org/gmane.comp.python.pytables.user/month=20130101 On Fri, May 10, 2013 at 4:58 AM, Andreas Hilboll li...@hilboll.de wrote: Hi, I'll have to code multilinear interpolation in n dimensions, n~7. My data space is quite large, ~10**9 points. The values are given on a rectangular (but not square) grid. The values are numbers in a range of approx. [0.0, 100.0]. The challenge is to do this efficiently, and it would be great if the whole thing would be able to run fast on a machine with only 8G (or better 4G) RAM. A common task will be to interpolate 10**6 points, which souldn't take too long. Any ideas on how to do this efficiently are welcome: * which dtype to use? * is using pytables/blosc an option? How can this be integrated in the interpolation? * you name it ... ;) Cheers, Andreas. -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Learn Graph Databases - Download FREE O'Reilly Book Graph Databases is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Row.append()
On Fri, May 3, 2013 at 1:15 PM, Jim Knoll jim.kn...@spottradingllc.comwrote: I am trying to make this better / faster… Data comes faster than I can store it on one box. So My though was to have many boxes each storing their own part in their own table. Later I would concatenate the tables together with something like this:*** * ** ** dest_h5f = pt.openFile(path + 'big_mater.h5','a') for source_path in source_h5_path_list: h5f = pt.openFile(source_path,'r') for node in h5f.root: dest_table = dest_h5f.getNode('/', name = node.name) print node.nrows if node.nrows 0 and node.nrows 100: # found I needed to limit the max size or I would crash dest_table.append(node.read()) dest_table.flush() h5f.close() dest_h5f.close() ** ** I could add the logic to iter in chunks over the source data to overcome the crash and but I suspect there could be a better way. Hi Jim, You can just iterate over each row in the table (ie for row in node). This is slow, but would solve the problem. ** Take a table in one h5 file and append it to a table in another h5 file. Looked like Table.copy() would do the trick but don’t see how I get it to append to an existing table. You could append directly by using the where_append() method with the condition 'True' to append the whole table. This will automatically do the chunking for you. Be Well Anthony ** ** My h5 files have 4 rec arrays all stored in root. ** ** Any suggestions? -- *Jim Knoll* * **DBA/Developer II* Spot Trading L.L.C 440 South LaSalle St., Suite 2800 Chicago, IL 60605 Office: 312.362.4550 Direct: 312-362-4798 Fax: 312.362.4551 jim.kn...@spottradingllc.com www.spottradingllc.com -- The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Spot Trading, LLC -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] ANN: PyTables 3.0 beta1
Whoo hoo! Thanks for all of your hard work Antonio! PyTables users, we'd really appreciate it if you could try out this beta release, run the test suite: $ python -c import tables as tb; tb.test() And let us know if there are any issues. Additionally, if you are feeling brave, any help you can give closing out the last remaining issues [1] would be great! Be Well Anthony 1. https://github.com/PyTables/PyTables/issues?milestone=4state=open On Sat, Apr 27, 2013 at 6:51 AM, Antonio Valentino antonio.valent...@tiscali.it wrote: = Announcing PyTables 3.0.0b1 = We are happy to announce PyTables 3.0.0b1. PyTables 3.0.0b1 comes after about 5 years from the last major release (2.0) and 7 months since the last stable release (2.4.0). This is new major release and an important milestone for the PyTables project since it provides the long waited support for Python 3.x that is being around for already 4 years now. Almost all the main numeric/scientific packages for python already support Python 3 so we are very happy that now also PyTables can provide this important feature. What's new == A short summary of main new features: - Since this release PyTables provides full support to Python 3 - The entire code base is now more compliant with coding style guidelines describe in the PEP8. - Basic support for HDF5 drivers. Now it is possible to open/create an HDF5 file using one of the SEC2, DIRECT, LOG, WINDOWS, STDIO or CORE drivers. - Basic support for in-memory image files. An HDF5 file can be set from or copied into a memory buffer. - Implemented methods to get/set the user block size in a HDF5 file. - All read methods now have an optional *out* argument that allows to pass a pre-allocated array to store data. - Added support for the floating point data types with extended precision (Float96, Float128, Complex192 and Complex256). Please refer to the RELEASE_NOTES document for a more detailed list of changes in this release. As always, a large amount of bugs have been addressed and squashed as well. In case you want to know more in detail what has changed in this version, please refer to: http://pytables.github.io/release_notes.html You can download a source package with generated PDF and HTML docs, as well as binaries for Windows, from: http://sourceforge.net/projects/pytables/files/pytables/3.0.0b1 For an online version of the manual, visit: http://pytables.github.io/usersguide/index.html What it is? === PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. PyTables includes OPSI, a new indexing technology, allowing to perform data lookups in tables exceeding 10 gigarows (10**10 rows) in less than a tenth of a second. Resources = About PyTables: http://www.pytables.org About the HDF5 library: http://hdfgroup.org/HDF5/ About NumPy: http://numpy.scipy.org/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Most specially, a lot of kudos go to the HDF5 and NumPy makers. Without them, PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Team -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] ANN: numexpr 2.1 (Python 3 support is here!)
Congrats Francesc! On Sat, Apr 27, 2013 at 5:07 AM, Francesc Alted fal...@gmail.com wrote: Announcing Numexpr 2.1 Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's VML library (included in Intel MKL), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational kernel for projects that don't want to adopt other solutions that require more heavy dependencies. What's new == The main feature of this version is that it adds a much needed **compatibility for Python 3** Many thanks to Antonio Valentino for his fine work on this. Also, Christoph Gohlke quickly provided feedback and binaries for Windows and Mark Wiebe and Gaëtan de Menten provided many small (but important!) fixes and improvements. All of you made numexpr 2.1 the best release ever. Thanks! In case you want to know more in detail what has changed in this version, see: http://code.google.com/p/numexpr/wiki/ReleaseNotes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at Google code in: http://code.google.com/p/numexpr/ You can get the packages from PyPI as well: http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! Francesc Alted -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] ANN: PyTables 3.0 beta1
On Sat, Apr 27, 2013 at 2:26 PM, Andreas Hilboll li...@hilboll.de wrote: Am 27.04.2013 19:42, schrieb Anthony Scopatz: On Sat, Apr 27, 2013 at 12:35 PM, Andreas Hilboll li...@hilboll.de mailto:li...@hilboll.de wrote: Am 27.04.2013 19 tel:27.04.2013%2019:17, schrieb Anthony Scopatz: Whoo hoo! Thanks for all of your hard work Antonio! PyTables users, we'd really appreciate it if you could try out this beta release, run the test suite: $ python -c import tables as tb; tb.test() And let us know if there are any issues. Additionally, if you are feeling brave, any help you can give closing out the last remaining issues [1] would be great! $ virtualenv --system-site-packages .virtualenvs/pytables-test (pytables-test) $ python -c import tables; tables.test() Traceback (most recent call last): File string, line 1, in module File tables/__init__.py, line 82, in module from tables.utilsextension import (get_pytables_version, get_hdf5_version, ImportError: No module named utilsextension This seems like you didn't compile and install PyTables first. So to be more clear: ~ $ cd pytables ~/pytables $ python setup.py install ~/pytables $ cd .. ~ $ python -c import tables; tables.test() Be Well Anthony -- Andreas. Sorry, didn't write down that line. Actually, I did compile and install pytables using python setup.py install from within the virtualenv. The problem was that I ran that command from within the installation directory, so that `import tables` didn't import the installed version. I keep making that mistake with every project at least twice :-/ When you try to do that in scipy, it gives a warning. Maybe it would be a good idea to do this in pytables as well? That is a good idea! The tests all ran well: Glad they passed! Be Well Anthony $ python -c import tables as tb; tb.test() -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 3.0.0b1 HDF5 version: 1.8.4-patch1 NumPy version: 1.6.1 Numexpr version: 1.4.2 (not using Intel's VML/MKL) Zlib version: 1.2.3.4 (in Python interpreter) BZIP2 version: 1.0.6 (6-Sept-2010) Blosc version: 1.2.1-rc1 (2013-04-24) Cython version:0.15.1 Python version:2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] Platform: linux2-x86_64 Byte-ordering: little Detected cores:2 Default encoding: ascii -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= [...] Ran 5242 tests in 493.636s OK and $ python -c import tables as tb; tb.test() -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 3.0.0b1 HDF5 version: 1.8.4-patch1 NumPy version: 1.6.1 Numexpr version: 2.1 (not using Intel's VML/MKL) Zlib version: 1.2.3.4 (in Python interpreter) BZIP2 version: 1.0.6 (6-Sept-2010) Blosc version: 1.2.1-rc1 (2013-04-24) Cython version:0.19 Python version:3.2.3 (default, Oct 19 2012, 20:10:41) [GCC 4.6.3] Platform: linux2-x86_64 Byte-ordering: little Detected cores:2 Default encoding: utf-8 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= [...] Ran 5217 tests in 526.794s OK -- -- Andreas. -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] In-kernel searches not returning values?
Hello Giovanni! This definitely seems like a bug. How was the column indexed? Could you send a sample script that reproduces the problem from start to finish? Thanks. Be Well Anthony On Fri, Apr 26, 2013 at 6:14 PM, Giovanni Luca Ciampaglia glciamp...@gmail.com wrote: Hi, I am new to PyTables and I like it very much though there are still some problems I am trying to solve. The latest is that I am seeing a strange behavior when using in-kernel searches. The seach condition is a simple equality test on a single column. Basically, when the column is indexed, in-kernel searches don't return the expected result, that is: In [150]: [ row['visits'] for row in ap.where('rid == 665689') ] Out[150]: [] In [151]: [ row['visits'] for row in ap if row['rid'] == 665689 ] Out[151]: [18L] When I remove the index, it works again: In [153]: ap.cols.rid.removeIndex() In [154]: [ row['visits'] for row in ap.where('rid == 665689') ] Out[154]: [18L] Am I doing something wrong? This is an excerpt of the contents of the file: - % h5ls -ld test.h5|head AllPages Dataset {529000/Inf} Data: (0) {year=2008, month=1, day=1, hour=0, minute=0, epoch=1199145600, rid=665689, (0) visits=18}, (1) {year=2008, month=1, day=1, hour=0, minute=0, epoch=1199145600, rid=2, (1) visits=11}, (2) {year=2008, month=1, day=1, hour=0, minute=0, epoch=1199145600, rid=12, (2) visits=1}, (3) {year=2008, month=1, day=1, hour=0, minute=0, epoch=1199145600, rid=612075, (3) visits=8}, And this is the table description: Out[152]: /AllPages (Table(529000,), shuffle, zlib(5)) '' description := { year: UInt16Col(shape=(), dflt=0, pos=0), month: UInt8Col(shape=(), dflt=0, pos=1), day: UInt8Col(shape=(), dflt=0, pos=2), hour: UInt8Col(shape=(), dflt=0, pos=3), minute: UInt8Col(shape=(), dflt=0, pos=4), epoch: UInt32Col(shape=(), dflt=0, pos=5), rid: UInt32Col(shape=(), dflt=0, pos=6), visits: UInt32Col(shape=(), dflt=0, pos=7)} byteorder := 'little' chunkshape := (233016,) autoIndex := True colindexes := { rid: Index(1, light, shuffle, zlib(1)).is_CSI=False} Thanks! -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciam...@indiana.edu -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] sourceforge downloads corrupted?
Hey Matt, is this related? https://github.com/PyTables/PyTables/issues/223 Be Well Anthony On Wed, Apr 24, 2013 at 3:09 PM, Matt Terry matt.te...@gmail.com wrote: Hello, The source tarball for pytables 2.4 on sourceforge appears to be broken. The file size is suspiciously small (800 kB vs 8.5MB on PyPI), the tarball doesn't untar, and the md5 doesn't match. -matt -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Try New Relic Now We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Documentation for stable releases?
Hello Gaëtan, Thanks for bringing this up and I think that older versions of the docs are a fairly important thing to have. I have opened an issue for this on github [1]. However, I doubt that I will have an opportunity to take care of this in the short term. So if you want to take care of this issue for the benefit of yourself and all, I would love to see a pull request ;). Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/236 On Mon, Apr 22, 2013 at 6:09 AM, Gaëtan de Menten gdemen...@gmail.comwrote: Hello all, TL;DR: It would be nice to have online documentation for stable versions and have pytables.github.io point to the doc for the latest stable release by default. I just tried to use the new out= argument to table.read, only to find out it did not work in my version (2.3.1). Then I tried to update my version to 2.4 since I thought it was implemented in that version because of the 2.4.0+1.dev name at the top of the page which I thought meant dev version leading to 2.4, or maybe to 2.4.1, but certainly not the next major release. I got even more confused because, after the initial failure with my 2.3.1 release, I checked the release notes... which I thought were for 2.4 because the title of the release notes page is Release notes for PyTables 2.4 series when it is in fact for the next major version... Here are a couple suggestions: * doc for stable releases (default to latest stable), bonus points to be able to switch easily from one version to another, a-la Python stdlib. * change 2.4.0+1.dev to 3.0-dev or 3.0-pre, and all mentions of 2.4.x * have new arguments to functions documented in the docstring for the functions (like in Python stdlib): new in pytables 3.0 in the docstring for table.read() would have made wonders. Thanks in adance, -- Gaëtan de Menten -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Call for help: PyTables 3.0 release (w/ Python 3.x support)
Hello Thadeus, Thanks for posting this PR! Once it is fixed for Python 3, we'd love to see it merged in. Be Well Anthony On Mon, Apr 22, 2013 at 2:52 PM, Thadeus Burgess thade...@thadeusb.comwrote: Hopefully this pull request can be included in the next version? It is keeping us from using the CSI functionality of PyTables. https://github.com/PyTables/PyTables/pull/238 -- Thadeus On Tue, Apr 16, 2013 at 4:10 PM, Anthony Scopatz scop...@gmail.comwrote: Hello PyTables Users, To let you know, we are hoping to do a PyTables 3.0-beta release here in the next week or two. This will include the long awaited Python 3 support thanks to the heroic efforts of Antonio Valentino, who did the lion's share of the porting work for both PyTables AND one of our dependencies, numexpr. However, to really make this release the best possible, we are asking for your help in cleaning up and closing some of the remaining issues. You can see our list of open issues for this release here [1]. You can also see out todo list for this release here [2]. *If you have a feature that you'd really love to see make it into the code base, now is the time to implement it.* If you have always wanted to contribute, but weren't sure how to get going, please fork the repo on github and then issue a pull request. If you have any questions about this process feel free to ask in this thread. Here is to a great next release! The PyTables Developers 1. https://github.com/PyTables/PyTables/issues?milestone=4state=open 2. https://github.com/PyTables/PyTables/wiki/NextReleaseTodo -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Row.append() performance
Hello Shyam, Can you please post the full traceback? In any event, I am fairly certain that this error is coming from the np.fromiter step. The problem here is that you are trying to read yur entire SQL query into a single numpy array in memory. This is impossible because you don't have enough RAM. Therefore, you are going to need to read and write in chunks. Something like the following: def getDataAndWriteHDF5(table): databaseConn= pyodbc.connect(connection string, password) cursor= databaseConn.cursor() cursor.execute(SQL Query) dt = np.dtype([('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', numpy.str_, 180), ('phone', numpy.str_, 256)]) citer = iter(cursor) chunksize = 4096 # This is just a guess, other values might work better crange = range(chunksize) while True: resultSet = np.fromiter((tuple(row) for i, row in zip(crange, citer)), dtype=dt) table.append(resultSet) if len(resultSet) chunksize: break You may want to tweak some things, but that is the basic strategy. Be Well Anthony On Mon, Apr 15, 2013 at 10:16 PM, Shyam Parimal Katti spk...@nyu.eduwrote: Hello Anthony, Thank you for your suggestions. When I mentioned that I am reading the data from database, I meant a DB2 database, not a HDF5 database/file. I followed your suggestions, so the code looks as follows: def createHDF5File(): h5File= tables.openFile(file name, mode=a) table.createTable(h5File.root, Contact, Contact, Contact, expectedrows=700) . def getDataAndWriteHDF5(table): databaseConn= pyodbc.connect(connection string, password) cursor= databaseConn.cursor() cursor.execute(SQL Query) resultSet= np.fromiter(( tuple(row) for row in cursor), dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', numpy.str_, 180), ('phone', numpy.str_, 256)]) table.append(resultSet) Error message: MemoryError: cannot allocate array memory. I am setting the `expectedrows` parameter when creating the table in HDF5 file, and yet encounter the error above. Looking forward to suggestions. Hello Anthony, Thank you for replying back with suggestions. In response to your suggestions, I am *not reading the data from a file in the first step, but instead a database*. Hello Shyam, To put too fine a point on it, hdf5 databases are files. And reading from any kind of file incurs the same disk read overhead. I did try out your 1st suggestion of doing a table.append(list of tuples), which took a little more than the executed time I got with the original code. Can you please guide me in how to chunk the data (that I got from database and stored as a list of tuples in Python) ? Ahh, so you should not be using list of tuples. These are Pythonic types and conversion between HDF5 types and Python types is what is slowing you down. You should be passing a numpy structured array into append(). Numpy types are very similar (and often exactly the same as) HDF5 types. For large, continuous, structured data you want to avoid the Python interpreter as much as possible. Use Python here as the glue code to compose a series of fast operations using the APIs exposed by numpy, pytables, etc. Be Well Anthony On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti spk...@nyu.eduwrote: Hello Anthony, Thank you for replying back with suggestions. In response to your suggestions, I am *not reading the data from a file in the first step, but instead a database*. I did try out your 1st suggestion of doing a table.append(list of tuples), which took a little more than the executed time I got with the original code. Can you please guide me in how to chunk the data (that I got from database and stored as a list of tuples in Python) ? Thanks, Shyam Hi Shyam, The pattern that you are using to write to a table is basically one for writing Python data to HDF5. However, your data is already in a machine / HDF5 native format. Thus what you are doing here is an excessive amount of work: read data from file - convert to Python data structures - covert back to HDF5 data structures - write to file. When reading from a table you get back a numpy structured array (look them up on the numpy website). Then instead of using rows to write back the data, just use Table.append() [1] which lets you pass in a bunch of rows simultaneously. (Note: that you data in this case is too large to fit into memory, so you may have to spit it up into chunks or use the new iterators which are in the development branch.) Additionally, if all you are doing is copying a table wholesale, you should use the Table.copy(). [2] Or if you only want to copy some subset based on a conditional you provide, use whereAppend() [3]. Finally, if you want to do math or evaluate expressions on
Re: [Pytables-users] Some method like a table.readWhereSorted
Thanks for bringing this up, Julio. Hmm I don't think that this exists currently, but since there are readWhere() and readSorted() it shouldn't be too hard to implement. I have opened issue #225 to this effect. Pull requests welcome! https://github.com/PyTables/PyTables/issues/225 Be Well Anthony On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker louis.wic...@noaa.govwrote: I am also interested in the this capability, if it exists in some way... Lou On Apr 10, 2013, at 12:35 PM, Julio Trevisan juliotrevi...@gmail.com wrote: Hi, Is there a way that I could have the ability of readWhere (i.e., specify condition, and fast result) but also using a CSIndex so that the rows come sorted in a particular order? I checked readSorted() but it is iterative and does not allow to specify a condition. Julio -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users | Dr. Louis J. Wicker | NSSL/WRDD Rm 4366 | National Weather Center | 120 David L. Boren Boulevard, Norman, OK 73072 | | E-mail: louis.wic...@noaa.gov | HTTP:http://www.nssl.noaa.gov/~lwicker | Phone:(405) 325-6340 | Fax:(405) 325-6780 | | I For every complex problem, there is a solution that is simple, | neat, and wrong. | | -- H. L. Mencken | | The contents of this message are mine personally and | do not reflect any position of the Government or NOAA. -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] ReadWhere() with a Time64Col in the condition
On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan juliotrevi...@gmail.comwrote: Hi, I am using a Time64Col called timestamp in a condition, and I noticed that the condition does not work (i.e., no rows are selected) if I write something as: for row in node.where(timestamp == %f % t): ... However, I had this idea of dividing the values by, say 1000, and it does work: for row in node.where(timestamp/1000 == %f % t/1000): ... However, this doesn't seem to be an elegant solution. Please could someone point out a better solution to this? Hello Julio, While this may not be the most elegant solution it is probably one of the most appropriate. The problem here likely stems from the fact that floating point numbers (which are how Time64Cols are stored) are not exact representations of the desired value. For example: In [1]: 1.1 + 2.2 Out[1]: 3.3003 So when you divide my some constant order of magnitude, you are chopping off the error associated with floating point precision. You are creating a bin of this constant's size around the target value that is close enough to count as equivalent. There are other mechanisms for alleviating this issue: dividing and multiplying back (x/10)*10 == y, right shifting (platform dependent), taking the difference and have it be less than some tolerance x - y = t. You get the idea. You have to mitigate this effect some how. For more information please refer to: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html Could this be related to the fact that my column name is timestamp? I ask this because I use a program called HDFView to brose the HDF5 file. This program refuses to show the first column when it is called timestamp, but shows it when it is called id. I don't know if the facts are related or not. This is probably unrelated. Be Well Anthony I don't know if this is useful information, but the conversion of a typical t to string gives something like this: print %f % t 1365597435.00 -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] ReadWhere() with a Time64Col in the condition
On Wed, Apr 10, 2013 at 11:40 AM, Julio Trevisan juliotrevi...@gmail.comwrote: Hi Anthony Thanks again.* *If it is a problem related to floating-point precision, I might use an Int64Col instead, since I don't need the timestamp miliseconds. Another good plan since integers are exact ;) Julio On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz scop...@gmail.comwrote: On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan juliotrevi...@gmail.comwrote: Hi, I am using a Time64Col called timestamp in a condition, and I noticed that the condition does not work (i.e., no rows are selected) if I write something as: for row in node.where(timestamp == %f % t): ... However, I had this idea of dividing the values by, say 1000, and it does work: for row in node.where(timestamp/1000 == %f % t/1000): ... However, this doesn't seem to be an elegant solution. Please could someone point out a better solution to this? Hello Julio, While this may not be the most elegant solution it is probably one of the most appropriate. The problem here likely stems from the fact that floating point numbers (which are how Time64Cols are stored) are not exact representations of the desired value. For example: In [1]: 1.1 + 2.2 Out[1]: 3.3003 So when you divide my some constant order of magnitude, you are chopping off the error associated with floating point precision. You are creating a bin of this constant's size around the target value that is close enough to count as equivalent. There are other mechanisms for alleviating this issue: dividing and multiplying back (x/10)*10 == y, right shifting (platform dependent), taking the difference and have it be less than some tolerance x - y = t. You get the idea. You have to mitigate this effect some how. For more information please refer to: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html Could this be related to the fact that my column name is timestamp? I ask this because I use a program called HDFView to brose the HDF5 file. This program refuses to show the first column when it is called timestamp, but shows it when it is called id. I don't know if the facts are related or not. This is probably unrelated. Be Well Anthony I don't know if this is useful information, but the conversion of a typical t to string gives something like this: print %f % t 1365597435.00 -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Reading single column from table
On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan juliotrevi...@gmail.comwrote: Hi, I just joined this list, I am using PyTables for my project and it works great and fast. I am just trying to optimize some parts of the program and I noticed that zipping the tuples to get one tuple per column takes much longer than reading the data itself. The thing is that readWhere() returns one tuple per row, whereas I I need one tuple per column, so I have to use the zip() function to achieve this. Is there a way to skip this zip() operation? Please see below: def quote_GetData(self, period, name, dt1, dt2): Returns timedata.Quotes object. Arguments: period -- value from within infogetter.QuotePeriod name -- quote symbol dt1, dt2 -- datetime.datetime or timestamp values t = time.time() node = self.quote_GetNode(period, name) ts1 = misc.datetime2timestamp(dt1) ts2 = misc.datetime2timestamp(dt2) L = node.readWhere( \ (timestamp/1000 = %f) (timestamp/1000 = %f) % \ (ts1/1000, ts2/1000)) rowNum = len(L) Q = timedata.Quotes() print %s: took %f seconds to do everything else % (name, time.time()-t) t = time.time() if rowNum 0: (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \ Q.numTrades) = zip(*L) print %s: took %f seconds to ZIP % (name, time.time()-t) return Q *And the printout:* BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP Hi Julio, The problem here isn't zip (packing and un-packing are generally fast operations -- they happen *all* the time in Python).Nor is the problem specifically with PyTables. Rather this is an issue with how you are using numpy structured arrays (look them up). Basically, this is slow because you are creating a list of column tuples where every element is a Python object of the corresponding type. For example upcasting every 32-bit integer to a Python int is very expensive! What you *should* be doing is keeping the columns as numpy arrays, which keeps the memory layout small, continuous, fast, and if done right does not require a copy (which you are doing now). The value of L here is a structured array. So say I have some other structured array with 4 fields, the right way to do this is to pull out each field individually by indexing a, b, c, d = x['a'], x['b'], x['c'], x['d'] or more generally (for all fields): a, b, c, d = map(lambda x: i[x], i.dtype.names) or for some list of fields: a, c, b = map(lambda x: i[x], ['a', 'c', 'b']) Timing both your original method and the new one gives: In [47]: timeit a, b, c, d = zip(*i) 1000 loops, best of 3: 1.3 ms per loop In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names) 10 loops, best of 3: 2.3 µs per loop So the method I propose is 500x-1000x times faster. Using numpy idiomatically is very important! Be Well Anthony -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Writing to CArray
On Sun, Mar 10, 2013 at 8:47 PM, Tim Burgess timburg...@mac.com wrote: On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote: Hey Tim, Awesome dataset! And neat image! As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?). To the more major question of write performance, one thing that you could try is compression. You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other). If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly. Another thing you could try doing is switching to EArray and using the append() method. This might save PyTables, numpy, hdf5, etc from having to check that the shape of sst_node[qual_indices] is actually the same as the data you are giving it. Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy). Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk. I would look at how you could restructure this to move as much as possible out of this loop. Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing. That is, unless you have a computer much bigger than my laptop ;). But issuing one fat write command is probably going to be faster than making 365 of them. Happy hacking! Be Well Anthony Thanks Anthony for being so responsive and touching on a number of points. The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;) I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here). I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap. For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough. Awesome! I am glad that this is working for you. Cheers and thanks again, Tim BTW I viewed your SciPy tutorial. Good stuff! Thanks! -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Writing to CArray
Hey Tim, Awesome dataset! And neat image! As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?). To the more major question of write performance, one thing that you could try is compressionhttp://pytables.github.com/usersguide/optimization.html#compression-issues. You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other). If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly. Another thing you could try doing is switching to EArray and using the append() method. This might save PyTables, numpy, hdf5, etc from having to check that the shape of sst_node[qual_indices] is actually the same as the data you are giving it. Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy). Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk. I would look at how you could restructure this to move as much as possible out of this loop. Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing. That is, unless you have a computer much bigger than my laptop ;). But issuing one fat write command is probably going to be faster than making 365 of them. Happy hacking! Be Well Anthony On Wed, Mar 6, 2013 at 11:25 PM, Tim Burgess timburg...@mac.com wrote: I'm producing a large chunked HDF5 using CArray and want to clarify that the performance I'm getting is what would normally be expected. The source data is a large annual satellite dataset - 365 days x 4320 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels of a certain quality so I am iterating over the source data (which is in daily files) and then determining the indices of all quality pixels in that day. There are usually about 2 million quality pixels in a day. I then set the equivalent CArray locations to the value of the quality pixels. As you can see in the code below, the source numpy array is a 1 x 4320 x 8640. So for addressing the CArray, I simply take the first index and set it to the current day to map indices to the 365 x 4320 x 8640 CArray. I've tried a couple of different chunkshapes. As I will be reading the HDF sequentially day by day and as the data comes from a polar-orbit, I'm using a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no data (and therefore reduce the total filesize). You can see an image of an example day at http://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine with two SSDs in RAID 0. The system has 64GB of RAM but I don't think memory is a constraint here. Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords in tables.hdf5Extension.Array Here's the pertinent code: for year in range(2011, 2012): # create dataset and add global attrs annualfile_path = '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year) print 'Creating ' + annualfile_path with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2 %d' % year)) as h5f: # write lat lons lat_node = h5f.createArray('/', 'lat', lats, title='latitude') lon_node = h5f.createArray('/', 'lon', lons, title='longitude') # glob all the region summaries in a year files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir, year))[0]] print 'Found %d days' % len(files) files.sort() # create a 365 x 4320 x 8640 array shape = (NUMDAYS, 4320, 8640) atom = tables.Float32Atom(dflt=np.nan) # we chunk into daily slices and then further chunk days sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape, chunkshape=(1, 1080, 240)) for filename in files: # get day day = int(filename[-25:-22]) print 'Processing %d day %d' % (year, day) ds = Dataset(filename) kelvin64 =
Re: [Pytables-users] checksum always verified?
I think that the checksum is on the compressed data... On Wed, Feb 27, 2013 at 2:16 PM, Frédéric Bastien no...@nouiz.org wrote: Hi, we just got some problem with our file server and this bring me question on how to detect corrupted files. There is a way to specify a filter when creating a table that add a checksum[1]. My questions is, when a file is created with checksum, are they always verified when the chunks are uncompressed? Can we specify when we open the file if we want to check it or not? The examples I found only talk about it when we create the file. thanks Frédéric Bastien [1] http://pytables.github.com/usersguide/libref/helper_classes.html -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] checksum always verified?
Sorry, I don't know. I never have used this feature. Maybe someone who has can chime in. On Wed, Feb 27, 2013 at 2:26 PM, Frédéric Bastien no...@nouiz.org wrote: That is fine with me. I just want to detect if my data got corrupted by hardware problems. Do someone know if it always get verified? Do you know if this cause significant speed difference? thanks Frédéric On Wed, Feb 27, 2013 at 3:21 PM, Anthony Scopatz scop...@gmail.com wrote: I think that the checksum is on the compressed data... On Wed, Feb 27, 2013 at 2:16 PM, Frédéric Bastien no...@nouiz.org wrote: Hi, we just got some problem with our file server and this bring me question on how to detect corrupted files. There is a way to specify a filter when creating a table that add a checksum[1]. My questions is, when a file is created with checksum, are they always verified when the chunks are uncompressed? Can we specify when we open the file if we want to check it or not? The examples I found only talk about it when we create the file. thanks Frédéric Bastien [1] http://pytables.github.com/usersguide/libref/helper_classes.html -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Problem using HDFStore in pandas on Windows 64-bit Anaconda CE
Hi Jon, Unfortunately, I have no way of testing this out. I will say that I have had problems with HDF5 and Anaconda on windows before since they only ship the static *.lib hdf5 libraries. So it may be the case that the pandas - pytables / hdf5 interface hasn't been properly linked. Barring someone on this list who can test things out for you, you might try grabbing the PyTables source from github and building it on top of your install of Anaconda. Sorry... Be Well Anthony On Fri, Feb 15, 2013 at 3:29 AM, Jon Rowland rowland@gmail.com wrote: Hi - apologies if this is a duplicate, I had an error sending the first time and wasn't sure if it made it through. I have an issue using pandas/HDFStore/pytables in the Anaconda CE distribution on Windows 64-bit. After a little troubleshooting with the Anaconda/pandas lists, it's been suggested that it might be a pytables issue (or at least some kind of package mismatch causing pytables not to work). I have a clean install of Anaconda 1.3.1 64-bit CE edition on a Windows 64-bit machine. Running the pytables self-test gives the following output: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.4.0 HDF5 version: 1.8.9 NumPy version: 1.6.2 Numexpr version: 2.0.1 (not using Intel's VML/MKL) Zlib version: 1.2.3 (in Python interpreter) Blosc version: 1.1.3 (2010-11-16) Cython version:0.17.4 Python version:2.7.3 |AnacondaCE 1.3.1 (64-bit)| (default, Jan 7 2013, 09:47:12) [MSC v.1500 64 bit (AMD64)] Byte-ordering: little Detected cores:4 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Then I get a *lot* of output to standard error - pages and pages of it - that looks something like this: C:\Anaconda\lib\site-packages\tables\filters.py:253: FiltersWarning: compression library ``bzip2`` is not available; using ``zlib`` instead % (complib, default_complib), FiltersWarning ) C:\Anaconda\lib\site-packages\tables\filters.py:253: FiltersWarning: compression library ``lzo`` is not available; using ``zlib`` instead % (complib, default_complib), FiltersWarning ) HDF5-DIAG: Error detected in HDF5 (1.8.9) thread 0: #000: ..\..\src\H5A.c line 241 in H5Acreate2(): not a type major: Invalid arguments to routine minor: Inappropriate type HDF5-DIAG: Error detected in HDF5 (1.8.9) thread 0: #000: ..\..\src\H5A.c line 920 in H5Awrite(): not an attribute major: Invalid arguments to routine minor: Inappropriate type EHDF5-DIAG: Error detected in HDF5 (1.8.9) thread 0: #000: ..\..\src\H5A.c line 241 in H5Acreate2(): not a type major: Invalid arguments to routine minor: Inappropriate type Is this something I'm doing wrong or is there something wrong with the package? Any help would be appreciated. Thanks, Jon -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Can't rename node with child node
Thanks for hunting this down Michka, It was a pretty simple change, so I went ahead and merged it in. Be Well Anthony On Sun, Feb 10, 2013 at 9:29 AM, Michka Popoff michkapop...@gmail.comwrote: After some (lng) code browsing I sent a pull request for my problem : https://github.com/PyTables/PyTables/pull/208 Thanks, I was not sure if it was a bug or an intended functionality preventing me to rename nodes with children. Michka Le 10 févr. 2013 à 09:08, Anthony Scopatz a écrit : Hey Michka, This seems like a bug. Please open an issue on github or submit a pull request if you figure out a fix. Thanks! Be Well Anthony On Sat, Feb 9, 2013 at 4:44 AM, Michka Popoff michkapop...@gmail.comwrote: Hello I am not able to rename a node which has parent nodes. The doc doesn't specify any restriction to the usage of the renameNode method. Here is a small example script to show what I want to achieve : import tables # Create file and groups file = tables.openFile(test.hdf5, w) file.createGroup(/, data, Data) file.createGroup(/data, id, Single Data) file.createGroup(/data/id/, curves1, Curve 1) file.createGroup(/data/id/, curves2, Curve 2) # Rename (works) file.renameNode(/data/id/curves1, newcurve1) # Rename (doesn't work) file.renameNode(/data/id, newid) The first rename will work and rename /data/id/curves1 to /data/id/newcurve1 The second rename will fail with the following traceback : Traceback (most recent call last): File Rename.py, line 14, in module file.renameNode(/data/id, newid) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/file.py, line 1157, in renameNode obj._f_rename(newname, overwrite) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/node.py, line 590, in _f_rename self._f_move(newname=newname, overwrite=overwrite) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/node.py, line 674, in _f_move self._g_move(newparent, newname) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/group.py, line 565, in _g_move self._v_file._updateNodeLocations(oldPath, newPath) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/file.py, line 2368, in _updateNodeLocations descendentNode._g_updateLocation(newNodePPath) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/node.py, line 414, in _g_updateLocation file_._refNode(self, newPath) File /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tables/file.py, line 2287, in _refNode file already has a node with path ``%s`` % nodePath AssertionError: file already has a node with path ``/data`` Closing remaining open files: test.hdf5... done Exception AttributeError: 'File' object has no attribute '_aliveNodes' in ignored Perhaps I can not do what I want to do here, or is there another method I should use ? Thanks in advance Michka Popoff -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Using built-in slice objects on CArray
Hi Andreas, I think that the problem here is that coord_slice is actually a list of slices, which you can't index by. (Though, you may be able to in numpy...) Try something like _ds[coord_slice[0]] instead. Be Well Anthony B eW On Tue, Jan 22, 2013 at 8:44 AM, Andreas Hilboll li...@hilboll.de wrote: Hi, how can I use Python's built-in `slice` object on CArray? Currently, I'm trying In: coord_slice Out: [slice(0, 31, None), slice(0, 5760, None), slice(0, 2880, None)] In: _ds Out: /data/mydata (CArray(31, 5760, 2880), shuffle, blosc(5)) '' atom := Float32Atom(shape=(), dflt=0.0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := (1, 45, 2880) In: _ds[coord_slice] Out: *** TypeError: long() argument must be a string or a number, not 'slice' The problem is that I want to write something generic, and I don't know beforehand how many dimensions the CArray has. My current plan is to create a tuple of slice objects programatically (using list comprehension), and then use this tuple as index. But apparently it doesn't work with pytables 2.3.1. Any suggestions on how to accomplish my task are greatly appreciated :) Cheers, Andreas. -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Using built-in slice objects on CArray
yeah, indexing with a list (rather than a tuple) has a different meaning. The most notable place I have seen list-indexing used is with numpy structured arrays. In all other locations the tuple slicing is for drilling down different dimensions, as you say. On Wed, Jan 23, 2013 at 10:25 AM, Andreas Hilboll li...@hilboll.de wrote: Am Mi 23 Jan 2013 16:57:27 CET schrieb Anthony Scopatz: Hi Andreas, I think that the problem here is that coord_slice is actually a list of slices, which you can't index by. (Though, you may be able to in numpy...) Try something like _ds[coord_slice[0]] instead. Be Well Anthony B eW On Tue, Jan 22, 2013 at 8:44 AM, Andreas Hilboll li...@hilboll.de mailto:li...@hilboll.de wrote: Hi, how can I use Python's built-in `slice` object on CArray? Currently, I'm trying In: coord_slice Out: [slice(0, 31, None), slice(0, 5760, None), slice(0, 2880, None)] In: _ds Out: /data/mydata (CArray(31, 5760, 2880), shuffle, blosc(5)) '' atom := Float32Atom(shape=(), dflt=0.0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := (1, 45, 2880) In: _ds[coord_slice] Out: *** TypeError: long() argument must be a string or a number, not 'slice' The problem is that I want to write something generic, and I don't know beforehand how many dimensions the CArray has. My current plan is to create a tuple of slice objects programatically (using list comprehension), and then use this tuple as index. But apparently it doesn't work with pytables 2.3.1. Any suggestions on how to accomplish my task are greatly appreciated :) Cheers, Andreas. -- Master Visual Studio, SharePoint, SQL, ASP.NET http://ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net mailto:Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users Hi Anthony, thanks for your input. However, I need to slice in multiple dimensions simultaneously, because my array is very large and I don't want to clog memory. However, I found out that it works with a tuple of slice objects, so _ds[tuple(coord_slice)] works as expected. Cheers, Andreas. -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Nested Iteration of HDF5 using PyTables
HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Nested Iteration of HDF5 using PyTables
Yup, that is right, thanks Josh! On Thu, Jan 3, 2013 at 12:29 PM, Josh Ayers josh.ay...@gmail.com wrote: David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the element column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote: HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.comwrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 3
On Thu, Jan 3, 2013 at 2:17 PM, David Reed david.ree...@gmail.com wrote: Thanks a lot for the help so far guys! Looking at itertools, I found what I believe to be the perfect function for what I need, itertools.combinations. This appears to be a valid replacement to the method proposed. Yes, combinations is awesome! There is a small problem that I didn't mention is that my compare function actually takes as inputs 2 columns from the table. Like so: D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element1'][ii], data['element1'][jj],data['element2'][ii], data['element2'][jj]) Is there an efficient way of using itertools with this structure? You can always make two other iterators for each column. Since you have two columns you would have 4 iterators. I am not sure how fast this is going to be but I am confident that there is definitely a way to do this in one for-loop, which is going to be way faster than nested loops. Be Well Anthony On Thu, Jan 3, 2013 at 1:29 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Josh Ayers) -- Message: 1 Date: Thu, 3 Jan 2013 10:29:33 -0800 From: Josh Ayers josh.ay...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: cacob4anozyd7dafos7sxs07mchzb8zbripbbrvbazrv4weq...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the element column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz scop...@gmail.com wrote: HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed david.ree...@gmail.com wrote: I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements
Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 4
Josh is right that you can just edit the code by hand (which works but sucks). However, on Windows -- on the rare occasion when I also have to develop on it -- I typically use a distribution that includes a compiler, cython, hdf5, and pytables already and then I install my development version from github OVER this. I recommend either EPD or Anaconda, though other distributions listed here [1] might also work. Be well Anthony 1. http://numfocus.org/projects-2/software-distributions/ On Thu, Jan 3, 2013 at 3:46 PM, Josh Ayers josh.ay...@gmail.com wrote: The change was in pure Python code, so you should be able to just paste in the changes to your local copy. Start with the table.Column.__iter__ method (lines 3296-3310) here. https://github.com/PyTables/PyTables/blob/b479ed025f4636f7f4744ac83a89bc947808907c/tables/table.py It needs to be modified slightly because it uses some additional features that aren't available in the released version (the out=buf_slice argument to table.read). The following should work. def __iter__(self): table = self.table itemsize = self.dtype.itemsize nrowsinbuf = table._v_file.params['IO_BUFFER_SIZE'] // itemsize max_row = len(self) for start_row in xrange(0, len(self), nrowsinbuf): end_row = min([start_row + nrowsinbuf, max_row]) buf = table.read(start_row, end_row, 1, field=self.pathname) for row in buf: yield row I haven't tested this, but I think it will work. Josh On Thu, Jan 3, 2013 at 1:25 PM, David Reed david.ree...@gmail.com wrote: I apologize if I'm starting to sound helpless, but I'm forced to work on Windows 7 at work and have never had luck compiling python source successfully. I have had to rely on precompiled binaries and now its biting me in the butt. Is there any quick fix I can do to improve this iteration using v2.4.0? On Thu, Jan 3, 2013 at 3:17 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Pytables-users Digest, Vol 80, Issue 2 (David Reed) 2. Re: Pytables-users Digest, Vol 80, Issue 3 (David Reed) -- Message: 1 Date: Thu, 3 Jan 2013 13:44:29 -0500 From: David Reed david.ree...@gmail.com Subject: Re: [Pytables-users] Pytables-users Digest, Vol 80, Issue 2 To: pytables-users@lists.sourceforge.net Message-ID: CAM6XA7=8ocg5WPD4KLSvLhSw-3BCvq5u7MRxq3Ajd6ha= ev...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Thanks Anthony, but unless Im missing something I don't think that method will work since this will only be comparing the ith element with ith+1 element. I still need 2 for loops right? Using itertools might speed things up though, I've never used them so I will give it a shot and let you know how it goes. Looks like I need to download the latest release before I do that too. Thanks for the help. -Dave On Thu, Jan 3, 2013 at 12:12 PM, pytables-users-requ...@lists.sourceforge.net wrote: Send Pytables-users mailing list submissions to pytables-users@lists.sourceforge.net To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/pytables-users or, via email, send a message with subject or body 'help' to pytables-users-requ...@lists.sourceforge.net You can reach the person managing the list at pytables-users-ow...@lists.sourceforge.net When replying, please edit your Subject line so it is more specific than Re: Contents of Pytables-users digest... Today's Topics: 1. Re: Nested Iteration of HDF5 using PyTables (Anthony Scopatz) -- Message: 1 Date: Thu, 3 Jan 2013 11:11:47 -0600 From: Anthony Scopatz scop...@gmail.com Subject: Re: [Pytables-users] Nested Iteration of HDF5 using PyTables To: Discussion list for PyTables pytables-users@lists.sourceforge.net Message-ID: CAPk-6T5b= 1egagp4+jhjcd3_4fnvbxrob2jbhay45rwdqzy...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out
Re: [Pytables-users] pytables: could not find the HDF5 runtime
Try leaving the pytables source dir and then running then running IPython. On Mon, Dec 10, 2012 at 9:20 AM, Jennifer Flegg jennifer.fl...@wwarn.orgwrote: Hi, I'm trying to install pytables and its proving difficult (using MAC OS 10.6.4). I have installed in /usr/local/hdf5 and set the environment variable $HDF5_DIR to /usr/local/hdf5. When I run setup, I get a warning about not being able to find the HDF5 runtime. ndmmac149:tables-2.4.0 jflegg$ sudo python setup.py install --hdf5=/usr/local/hdf5 * Found numpy 1.6.1 package installed. * Found numexpr 2.0.1 package installed. * Found Cython 0.17.2 package installed. * Found HDF5 headers at ``/usr/local/hdf5/include``, library at ``/usr/local/hdf5/lib``. .. WARNING:: Could not find the HDF5 runtime. The HDF5 shared library was *not* found in the default library paths. In case of runtime problems, please remember to install it. ld: library not found for -llzo2 collect2: ld returned 1 exit status ld: library not found for -llzo2 collect2: ld returned 1 exit status * Could not find LZO 2 headers and library; disabling support for it. ld: library not found for -llzo collect2: ld returned 1 exit status ld: library not found for -llzo collect2: ld returned 1 exit status * Could not find LZO 1 headers and library; disabling support for it. * Found bzip2 headers at ``/usr/include``, library at ``/usr/lib``. running install running build running build_py creating build creating build/lib.macosx-10.5-i386-2.7 creating build/lib.macosx-10.5-i386-2.7/tables copying tables/__init__.py - build/lib.macosx-10.5-i386-2.7/tables copying tables/array.py - build/lib.macosx-10.5-i386-2.7/tables When I import pytables in python, I get the following error message In [1]: import tables - ImportError Traceback (most recent call last) /Users/jflegg/ipython-input-1-389ecae14f10 in module() 1 import tables /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- packages/tables/__init__.py in module() 28 29 # Necessary imports to get versions stored on the Pyrex extension --- 30 from tables.utilsExtension import getPyTablesVersion, getHDF5Version 31 32 ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/7.3 /lib/python2.7/site-packages/tables/utilsExtension.so, 2): Symbol not found: _H5E_CALLBACK_g Referenced from: /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- packages/tables/utilsExtension.so Expected in: flat namespace in /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- packages/tables/utilsExtension.so Any help would be greatly appreciated. Jennifer -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] pytables: could not find the HDF5 runtime
Hi Jennifer, Yeah, that is right, they are not in EPD Free. However, they are in Anaconda CE (http://continuum.io/downloads.html). Note the CE rather than the full version. Be Well Anthony On Mon, Dec 10, 2012 at 4:07 PM, Jennifer Flegg jennifer.fl...@wwarn.orgwrote: Hi Anthony, Thanks for your reply. I installed HDF5 also from source. The reason I'm building hdf5 and pytables myself is that they don't seem to be available through EPD any more (at least in the free version: http://www.enthought.com/products/epdlibraries.php) They used to both come bundled in EPD, but not anymore, which is a pain. Many thanks, Jennifer -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Error reading attribute with compound data type
On Wed, Nov 28, 2012 at 1:03 AM, Antonio Valentino antonio.valent...@tiscali.it wrote: Hi Anthony, hi dashesy, Il giorno 28/nov/2012, alle ore 00:57, Anthony Scopatz scop...@gmail.com ha scritto: This [1] seems to indicate that this kind of thing should be supported via numpy structured arrays. However, I bet that this data set did not start out as a numpy structured array. This might explain the problem if the flavor is wrong. I would think that a fix should be relatively easy. Be Well Anthony 1. http://pytables.github.com/usersguide/libref/declarative_classes.html?highlight=attr#the-attributeset-class I'm not sure that PyTables is able to handle variable length strings in compound data types at the moment. Oops, I didn't notice that... Antonio is right, the variable length part of this is probably your issue. On Tue, Nov 27, 2012 at 5:17 PM, dashesy dash...@gmail.com wrote: I have a file that has attributes with nested compound type, when reading it with PyTables 2.4.0 I get this error: C:\Python27\lib\site-packages\tables\attributeset.py:293: DataTypeWarning: Unsupported type for attribute 'BmiRoot' in node '/'. Offending HDF5 class: 6 value = self._g_getAttr(self._v_node, name) C:\Python27\lib\site-packages\tables\attributeset.py:293: DataTypeWarning: Unsupported type for attribute 'BmiChanExt' in node 'channel1'. Offending HDF5 class: 6 value = self._g_getAttr(self._v_node, name) Yes, it is not clear Hard to say what exactly happens, just wanted to know if this is not already fixed in newer versions I will be more than happy to work on it, any pointers as to where to look is appreciated. I don't thing that there are changes that can impact on this issue. Anyway you can give a try to the development branch [1] Any help is very appreciated [1] https://github.com/PyTables/PyTables Here is the (partial) dump of the file (for brevity I deleted non-related data parts but can provide the full file if needed): HDF5 pause5-10-5.ns2.h5 { GROUP / { ATTRIBUTE BmiRoot { DATATYPE /BmiRootAttr_t DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): { 1, 0, 0, 1, 2008-12-02 22:57:02.251000, 1 kS/s, } } } DATATYPE BmiRootAttr_t H5T_COMPOUND { H5T_STD_U32LE MajorVersion; H5T_STD_U32LE MinorVersion; H5T_STD_U32LE Flags; H5T_STD_U32LE GroupCount; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } Date; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } Application; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } Comment; } GROUP channel { DATATYPE BmiChanAttr_t H5T_COMPOUND { H5T_STD_U16LE ID; H5T_IEEE_F32LE Clock; H5T_IEEE_F32LE SampleRate; H5T_STD_U8LE SampleBits; } DATATYPE BmiChanExt2Attr_t H5T_COMPOUND { H5T_STD_I32LE DigitalMin; H5T_STD_I32LE DigitalMax; H5T_STD_I32LE AnalogMin; H5T_STD_I32LE AnalogMax; H5T_STRING { STRSIZE 16; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } AnalogUnit; } DATATYPE BmiChanExtAttr_t H5T_COMPOUND { H5T_IEEE_F64LE NanoVoltsPerLSB; H5T_COMPOUND { H5T_STD_U32LE HighPassFreq; H5T_STD_U32LE HighPassOrder; H5T_STD_U16LE HighPassType; H5T_STD_U32LE LowPassFreq; H5T_STD_U32LE LowPassOrder; H5T_STD_U16LE LowPassType; } Filter; H5T_STD_U8LE PhysicalConnector; H5T_STD_U8LE ConnectorPin; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } Label; } DATATYPE BmiChanFiltAttr_t H5T_COMPOUND { H5T_STD_U32LE HighPassFreq; H5T_STD_U32LE HighPassOrder; H5T_STD_U16LE HighPassType; H5T_STD_U32LE LowPassFreq; H5T_STD_U32LE LowPassOrder; H5T_STD_U16LE LowPassType; } GROUP channel1 { ATTRIBUTE BmiChan { DATATYPE /channel/BmiChanAttr_t DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): { 1, 3, 1000, 16
Re: [Pytables-users] Histogramming 1000x too slow
On Mon, Nov 19, 2012 at 12:59 PM, Jon Wilson j...@fnal.gov wrote: Hi Anthony, On 11/17/2012 11:49 AM, Anthony Scopatz wrote: Hi Jon, Barring changes to numexpr itself, this is exactly what I am suggesting. Well,, either writing one query expr per bin or (more cleverly) writing one expr which when evaluated for a row returns the integer bin number (1, 2, 3,...) this row falls in. Then you can simply count() for each bin number. For example, if you wanted to histogram data which ran from [0,100] into 10 bins, then the expr r/10 into a dtype=int would do the trick. This has the advantage of only running over the data once. (Also, I am not convinced that running over the data multiple times is less efficient than doing row-based iteration. You would have to test it on your data to find out.) It is a reduction operation, and would greatly benefit from chunking, I expect. Not unlike sum(), which is implemented as a specially supported reduction operation inside numexpr (buggily, last I checked). I suspect that a substantial improvement in histogramming requires direct support from either pytables or from numexpr. I don't suppose that there might be a chunked-reduction interface exposed somewhere that I could hook into? This is definitively as feature to request from numexpr. I've been fiddling around with Stephen's code a bit, and it looks like the best way to do things is to read chunks (whether exactly of table.chunksize or not is a matter for optimization) of the data in at a time, and create histograms of those chunks. Then combining the histograms is a trivial sum operation. This type of approach can be generically applied in many cases, I suspect, where row-by-row iteration is prohibitively slow, but the dataset is too large to fit into memory. As I understand, this idea is the primary win of PyTables in the first place! So, I think it would be extraordinarily helpful to provide a chunked-iteration interface for this sort of use case. It can be as simple as a wrapper around Table.read(): class Table: def chunkiter(self, field=None): while n*self.chunksize self.nrows: yield self.read(n*self.chunksize, (n+1)*self.chunksize, field=field) Then I can write something like bins = linspace(-1,1, 101) hist = sum(histogram(chunk, bins=bins) for chunk in mytable.chunkiter(myfield)) Preliminary tests seem to indicate that, for a table with 1 column and 10M rows, reading in chunks of 10x chunksize gives the best read-time-per-row. This is perhaps naive as regards chunksize black magic, though... Hello Jon, Sorry about the slow reply, but I think that what is proposed in issue #27 [1] would solve the above by default, right? Maybe you could pull Josh's code and test it on the above example to make sure. And then we could go ahead and merge this in :). And of course, if implemented by numexpr, it could benefit from the nice automatic multithreading there. This would be nice, but as you point out, not totally necessary here. Also, I might dig in a bit and see about extending the field argument to read so it can read multiple fields at once (to do N-dimensional histograms), as you suggested in a previous mail some months ago. Also super cool, but not immediate ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 Best Regards, Jon -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] What is the best way to copy a table from one file to another?
Hey Aquil, I think File.copyNode() [1] with the newparent argument as group on another file will do what you want. Be Well Anthony 1. http://pytables.github.com/usersguide/libref/file_class.html?highlight=copy#tables.File.copyNode On Thu, Nov 8, 2012 at 10:02 AM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: I create the tables in an HDF5 file from three different python processes. I needed to modify one of the processes, but not the others. Is there an easy way to copy the two tables that did not change to the new file? -- Aquil H. Abdullah I never think of the future. It comes soon enough - Albert Einstein -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Can PyTable use 7Zip
Hello Jim, The major hurdle here is exposing 7Zip to HDF5. Luckily it appears as if this may have been taken care of for you by the HDF-group already [1]. You should google around to see what has already been done and how hard it is to install. The next step is to expose this as a compression option for filters [2]. I am fairly certain that this is just a matter of adding a simple flag and making sure 7Zip works if available. This should not be too difficult at all and we would happily consider/review any pull request that implemented this. Barring any major concerns, I feel that it would likely be accepted. Be Well Anthony 1. http://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.6/hdf5-1.6.7/src/unpacked/release_docs/INSTALL_Windows_From_Command_Line.txt 2. http://pytables.github.com/usersguide/libref/helper_classes.html#the-filters-class On Thu, Nov 8, 2012 at 9:52 PM, Jim Knoll jim.kn...@spottradingllc.comwrote: I would like to squeeze out as much compression as I can get. I do not mind spending time on the front end as long as I do not kill my read performance. Seems like 7Zip is well suited to my data. Is it possible to have 7Zip used as the native internal compression for a pytable? ** ** If not now hard would it be to add this option? -- *Jim Knoll* * **Data Developer* Spot Trading L.L.C 440 South LaSalle St., Suite 2800 Chicago, IL 60605 Office: 312.362.4550 Direct: 312-362-4798 Fax: 312.362.4551 jim.kn...@spottradingllc.com www.spottradingllc.com -- The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Spot Trading, LLC -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] pyTable index from c++
On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll jim.kn...@spottradingllc.comwrote: I love the index function and promote the internal use of PyTables at my company. The availability of a indexed method to speed the search is the main reason why. ** ** We are a mixed shop using c++ to create H5 (just for the raw speed … need to keep up with streaming data) End users start with python pyTables to consume the data. (Often after we have created indexes from python pytables.col.col1.createIndex()) ** ** Sometimes the users come up with something we want to do thousands of times and performance is critical. But then we are falling back to c++ We can use our own index method but would like to make dbl use of the PyTables index. ** ** I know the python table.where( is implemented in C. Hi Jim, This is only kind of true. Querying (ie all of the where*() methods) are actually mostly written in Python in the tables.py and expressions.py files. However, they make use of numexpr [1]. ** Is there a way to access that from c or c++?Don’t mind if I need to do work to get the result I think in my case the work may be worth it. *PLAN 1:* One possibility is that the parts of PyTables are written in Cython. We could maybe try (without making any edits to these files) to convert them to Cython. This has the advantage that for Cython files, if you write the appropriate C++ header file and link against the shared library correctly, it is possible to access certain functions from C/C++. BUT, I am not sure how much of speed boost you would get out of this since you would still be calling out to the Python interpreter to get these result. You are just calling Python's virtual machine from C++ rather than calling it from Python (like normal). This has the advantage that you would basically get access to these functions acting on tables from C++. *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ already. You should be able to call core numexpr functions directly. However, you would have to feed it data that you read from the tables yourself. These could even be table indexes. On a personal note, if you get code working that does this, I would be interested in seeing your implementation. (I have another project where I have tables that I want to query from C++) Let us know what route you ultimately end up taking or if you have any further questions! Be Well Anthony 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr -- *Jim Knoll* * **Data Developer* Spot Trading L.L.C 440 South LaSalle St., Suite 2800 Chicago, IL 60605 Office: 312.362.4550 Direct: 312-362-4798 Fax: 312.362.4551 jim.kn...@spottradingllc.com www.spottradingllc.com -- The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Spot Trading, LLC -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Is it possible to manipulate a VariableNode in a query?
Hello Aquil, Unfortunately, You currently cannot use indexing in queries (ie symbol[:3] == x) and may only use the whole variable (symbol == x. This is a limitation of numexpr. Please file a ticket with them, if you would like to see this changed. Sorry! Be Well Anthony On Tue, Oct 30, 2012 at 10:44 AM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: Hello All, I am querying a table that has a field with a string value. I would like to determine if the string matches a pattern. Is there a simple way to do that through readWhere and the condition syntax? None of the following work, but I was wondering if it were possible to do something similar: table.readWhere('CLZ' in field') or table.readWhere('symbol[:3] == CLZ') Thanks! -- Aquil H. Abdullah I never think of the future. It comes soon enough - Albert Einstein -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Large (to very large) datasets...
Hi Andrea, Your problem is two fold. 1. Your timing wasn't reporting the time per data set, but rather the total time since writing all data sets. You need to put the start time in the loop to get the time per data set. 2. Your larger problem was that you were writing too many times. Generally it is faster to write fewer, bigger sets of data than performing a lot of small write operations. Since you had data set opening and writing in a doubly nested loop, it is not surprising that you were getting terrible performance. You were basically maximizing HDF5 overhead ;). Using slicing I removed the outermost loop and saw timings like the following: H5 file creation time: 7.406 Saving results for table: 0.0105440616608 Saving results for table: 0.0158948898315 Saving results for table: 0.0164661407471 Saving results for table: 0.00654292106628 Saving results for table: 0.00676298141479 Saving results for table: 0.00664114952087 Saving results for table: 0.0066990852356 Saving results for table: 0.00687289237976 Saving results for table: 0.00664210319519 Saving results for table: 0.0157809257507 Saving results for table: 0.0141618251801 Saving results for table: 0.00796294212341 Please see the attached version, at around line 82. Additionally, if you need to focus on performance I would recommend reading the following ( http://pytables.github.com/usersguide/optimization.html). PyTables can be blazingly fast when implemented correctly. I would highly recommend looking into compression. I hope this helps! Be Well Anthony On Tue, Oct 30, 2012 at 4:55 PM, Andrea Gavana andrea.gav...@gmail.comwrote: Hi All, I am pretty new to pytables and I am facing a problem of actually storing and retrieving data to/from a large dataset. My situation is the following: 1. I am running stochastic simulations of a number of objects (typically between 100-1,000 simulations); 2. For every simulation, I have around 1,200 objects, and for each of them I have 7 timeseries of 600 time-steps each. I thought of using pytables to try and get some sense out of my simulations, but I am failing to implement something intelligent (or fast, which is important as well...). The attached script (modified from the pytables tutorial) does the following: 1. Create a table containing these objects; 2. Adds 1,200 rows, one per object: for each object, I assign a 3D array defined as: results = Float32Col(shape=(NUM_SIM, len(ALL_DATES), 7)) Where NUM_SIM is the number of simulations and ALL_DATES are the timesteps. 3. For every simulation, I update the object results (using random numbers in the script). The timings on my computer are as follows (in seconds): H5 file creation time: 22.510 Saving results for simulation 1 : 3.3356567 Saving results for simulation 2 : 6.2429997921 Saving results for simulation 3 : 9.1515041 Saving results for simulation 4 : 12.075752 Saving results for simulation 5 : 15.217902 Saving results for simulation 6 : 17.9159998894 Saving results for simulation 7 : 21.065847 Saving results for simulation 8 : 23.645084 Saving results for simulation 9 : 26.5359997749 Saving results for simulation 10 : 29.5579998493 As you can see, at every simulation the processing time increases by 3 seconds, so by the time I get to 100 or 1,000 I will have more than enough time for 15 coffees in the morning :-D Also, the file creation time is somewhat on the slow side... I am sure I am missing a lot of things here, so I would appreciate any suggestion to implement my code in a better/more intelligent way (and also suggestions on other approaches in order to do what I am trying to do). Thank you in advance for your suggestions. Andrea. Imagination Is The Only Weapon In The War Against Reality. http://www.infinity77.net -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users pytables_test.py Description: Binary data -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Large (to very large) datasets...
On Tue, Oct 30, 2012 at 6:20 PM, Andrea Gavana andrea.gav...@gmail.comwrote: Hi Anthony, On 30 October 2012 22:52, Anthony Scopatz wrote: Hi Andrea, Your problem is two fold. 1. Your timing wasn't reporting the time per data set, but rather the total time since writing all data sets. You need to put the start time in the loop to get the time per data set. 2. Your larger problem was that you were writing too many times. Generally it is faster to write fewer, bigger sets of data than performing a lot of small write operations. Since you had data set opening and writing in a doubly nested loop, it is not surprising that you were getting terrible performance. You were basically maximizing HDF5 overhead ;). Using slicing I removed the outermost loop and saw timings like the following: H5 file creation time: 7.406 Saving results for table: 0.0105440616608 Saving results for table: 0.0158948898315 Saving results for table: 0.0164661407471 Saving results for table: 0.00654292106628 Saving results for table: 0.00676298141479 Saving results for table: 0.00664114952087 Saving results for table: 0.0066990852356 Saving results for table: 0.00687289237976 Saving results for table: 0.00664210319519 Saving results for table: 0.0157809257507 Saving results for table: 0.0141618251801 Saving results for table: 0.00796294212341 Please see the attached version, at around line 82. Additionally, if you need to focus on performance I would recommend reading the following (http://pytables.github.com/usersguide/optimization.html). PyTables can be blazingly fast when implemented correctly. I would highly recommend looking into compression. I hope this helps! Thank you for your answer; indeed, I was timing it wrongly (I really need to go to sleep...). However, although I understand the need of writing fewer, I am not sure I can actually do it in my situations. Let me explain: 1. I have a GUI which starts a number of parallel processes (up to 16, depending on a user selection); 2. These processes actually do the computation/simulations - so, if I have 1,000 simulations to run and 8 parallel processes, each process gets 125 simulations (each of which holds 1,200 objects with a 600x7 timeseries matrix per object). Well, you can at least change the order of the loops and see if that helps. That is rather than doing: for i in xrange(): for p in table: Do the following instead: for p in table: for i in xrange(): I don't believe that this will help too much since you are still writing every element individually.. If I had to write out the results only at the end, it would mean for me to find a way to share the 1,200 objects matrices in all the parallel processes (and I am not sure if pytables is going to complain when multiple concurrent processes try to access the same underlying HDF5 file). Reading in parallel works pretty well. Writing causes more headaches but can be done. Or I could create one HDF file per process, but given the nature of the simulation I am running, every object in the 1,200 objects pool would need to keep a reference to a 125x600x7 matrix (assuming 1,000 simulations and 8 processes) around in memory *OR* I will need to write the results to the HDF5 file for every simulation. Although we have extremely powerful PCs at work, I am not sure it is the right way to go... As always, I am open to all suggestions on how to improve my approach. My basic suggestion is to have all of you processes produce results which are then aggregated by a single master process. This master is the only one which has write access to the hdf5 file and will allow you to create larger arrays and minimize the number of writes that you do. You'll probably want to take a look at this example: https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py I think that there might be a page in the docs about it now too... But I think that this is the strategy that you want to pursue. Multiple compute processes, one write process. Thank you again for your quick and enlightening answer. No problem! Be Well Anthony Andrea. Imagination Is The Only Weapon In The War Against Reality. http://www.infinity77.net # - # def ask_mailing_list_support(email): if mention_platform_and_version() and include_sample_app(): send_message(email) else: install_malware() erase_hard_drives() # - # -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Pytables-users mailing list Pytables-users
Re: [Pytables-users] Numpy data and Pytable array (error: IndexError: tuple index out of range)
Hello Jack, I am not really sure what is going wrong because you did not post the full code where the exception is happening. However, this error seems to be because the pnts array is one dimensional. (Which is why pnts.shape has a length of 1.) You could verify this by printing out pnts right before the line that fails. Also, why are you using ctypes? This seems wrong... Be Well Anthony On Sun, Oct 28, 2012 at 9:25 PM, JACK young.2...@yahoo.com wrote: Hi all, I am new to python and pytables. Currently I am writing a project about clustering and KNN algorithm. That is what I have got. ** code *** import numpy.random as npr import numpy as np #step0: obtain the cluster dtype = np.dtype('f4') pnts_inds = np.arange(100) npr.shuffle(pnts_inds) pnts_inds = pnts_inds[:10] pnts_inds = np.sort(pnts_inds) for i,ind in enumerate(pnts_inds): clusters[i] = pnts_obj[ind] #step1: save the result to a HDF5 file called clst_fn.h5 filters = tables.Filters(complevel = 1, complib = 'zlib') clst_fobj = tables.openFile('clst_fn.h5', 'w') clst_obj = clst_fobj.createCArray(clst_fobj.root, 'clusters', tables.Atom.from_dtype(dtype), clusters.shape, filters = filters) clst_obj[:] = clusters clst_fobj.close() #step2: other function blabla #step3: load the cluster from clst_fn pnts_fobj= tables.openFile('clst_fn.h5','r') for pnts in pnts_fobj.walkNodes('/', classname = 'Array'): break # #step4: evoke another function (called knn). The function input argument is the #data from pnts. I have checked the knn function individually. This function #works well if the input is pnts = npr.rand(100,128) def knn(pnts): pnts = numpy.ascontiguousarray(pnts) N = ctypes.c_uint(pnts.shape[0]) D = ctypes.c_uint(pnts.shape[1]) # # evoke knn using the cluster from clst_fn (see step 3) knn(pnts) ** end of code *** My problem now is that python is giving me a hard time by showing: error: IndexError: tuple index out of range This error comes from D = ctypes.c_uint(pnts.shape[1]) this line. Obviously, there must be something wrong with the input argument. Any thought about fixing the problem? Thank you in advance. -- The Windows 8 Center - In partnership with Sourceforge Your idea - your app - 30 days. Get started! http://windows8center.sourceforge.net/ what-html-developers-need-to-know-about-coding-windows-8-metro-style-apps/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- The Windows 8 Center - In partnership with Sourceforge Your idea - your app - 30 days. Get started! http://windows8center.sourceforge.net/ what-html-developers-need-to-know-about-coding-windows-8-metro-style-apps/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Tutorial at PyData Conference New York
Great! Thanks Francesc! On Sat, Oct 27, 2012 at 6:16 AM, Francesc Alted fal...@gmail.com wrote: Hi, You may be interested on my IPython notebooks and slides for the conference: http://pytables.org/download/PyData2012-NYC.tar.gz PyData-NYC-2012-v3.pptx http://www.pytables.org/docs/PyData2012-NYC.pdf [BTW this time I felt in love with IPython notebook: it is great!] Unfortunately, I had only 45 minutes for the presentation, so I have not been able to show the PyTables files samples that some of you kindly send to me (but I'll keep them for the future, one never knows!). -- Francesc Alted -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Tutorial at PyData Conference New York
On Sat, Oct 27, 2012 at 11:21 AM, Antonio Valentino antonio.valent...@tiscali.it wrote: Hi Francesc, congratulations! Il 27/10/2012 13:16, Francesc Alted ha scritto: Hi, You may be interested on my IPython notebooks and slides for the conference: http://pytables.org/download/PyData2012-NYC.tar.gz PyData-NYC-2012-v3.pptx http://www.pytables.org/docs/PyData2012-NYC.pdf [BTW this time I felt in love with IPython notebook: it is great!] yes, the IPython notebuok is fantastic! ... and the idea of saving tutorials into notebook files is very very nice :)) Maybe we could provide notebook files for all tutorials in the official doc. +1 ciao -- Antonio Valentino -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables data files for a tutorial
Hello Francesc, I look forward to your pydata hearing how your tutorial goes! Here [1] is a file that stores some basic nuclear data that is freely redistributable. It stores atomic weights, bound neutron scattering lengths, and pre-compiled neutron cross sections (xs) for 5 different energy regimes. Everything in here is a table. The file is rather (at about 165 kb). There are integer, float, and complex columns. I hope that this helps! Be Well Anthony 1. https://s3.amazonaws.com/pyne/prebuilt_nuc_data.h5 On Sun, Oct 21, 2012 at 10:41 AM, Francesc Alted fal...@pytables.orgwrote: Hi, I'm going to give a tutorial on PyTables next Thursday during the PyData conference in New York (http://nyc2012.pydata.org/) and I'd like to use some real life data files. So, if you have some public repository with data generated with PyTables, please tell me. I'm looking for files that are not very large ( 1GB), and that use the Table object significantly. A small description of the data included will be more that welcome too! Thanks! -- Francesc Alted -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] 10 years of PyTables
Congrats Francesc! It is a testament to how useful PyTables is that it is still around and going strong! Personally, I know that your fast, polite, and in-depth responses on the mailing list have made PyTables the great resources that it is. Additionally, it has served as a model to me for how open source projects *should* be run! I'd also really like to thank Antonio for driving new features into the code base! If only we were all on the same continent, we could have a PyTables birthday party or something... Be Well Anthony On Sun, Oct 21, 2012 at 10:26 AM, Francesc Alted fal...@pytables.orgwrote: Hi!, This month PyTables celebrates the 10th anniversary of its first public release: http://osdir.com/ml/python.scientific.user/2002-10/msg00043.html There one can read that very new features of Python like generators and metaclasses were leveraged. Even that a nascent Pyrex (the antecessor of Cython) was used for the extensions. Oh, what memories! The original text below: - Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values are stored in fixed-length fields. PyTables is intended to be easy-to-use, and tries to be a high-performance interface to HDF5. To achieve this, the newest improvements introduced in Python 2.2 (like generators or slots and metaclasses in new-brand classes) has been used. Pyrex creation extension tool has been chosen to access the HDF5 library. This package should be platform independent, but until now I've tested it only with Linux. It's the first public release (v 0.1), and it is in alpha state. You can get it from: http://sourceforge.net/projects/pytables/ There is still not a project home page. Perhaps in next days. Feedback welcome.! -- Francesc Alted PGP KeyID: 0x61C8C11F Scientific aplications developer Public PGP key available:http://www.openlc.org/falted_at_openlc.asc Key fingerprint = 1518 38FE 3A3D 8BE8 24A0 3E5B 1328 32CC 61C8 C11F -- Francesc Alted -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] multiprocessing and pytables
Hello Ernesto, So you are actually asking two different questions, one on reading and the other on writing. In general reading, or querying, with multiprocessing works very well. Writing to a single file with multiple processes is destined to failure though. So the strategy that many people have adopted is to have multiple processes create the data and then have a master process which acts as a queue for writing out the data. Please see the example here for more inspiration [1]. Note that we have been having problems recently with multiprocess writing out to multiple files, but that is not what you want to do. Be Well Anthony 1. https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py On Mon, Oct 15, 2012 at 11:45 AM, Ernesto Picardi e.pica...@unical.itwrote: Dear all, I have a hdf5 file including several tables. To speed up the creation of all tables, could I create each individual table by independent processes launched by multiprocessing module? Could I employ independent processes to query diverse tables of the same hdf5 file? Thank you very much in advance for whatever answer. Regards, Ernesto Riservatezza / Confidentiality In ottemperanza al D.Lgs. n. 196 del 30/6/2003 in materia di protezione dei dati personali, le informazioni contenute in questo messaggio sono strettamente riservate ed esclusivamente indirizzate al destinatario indicato (oppure alla persona responsabile di rimetterlo al destinatario). Vogliate tener presente che qualsiasi uso, riproduzione o divulgazione di questo messaggio e' vietato. Nel caso in cui aveste ricevuto questo messaggio per errore, vogliate cortesemente avvertire il mittente e distruggere il presente messaggio. -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Closing Read-Only Files
On Fri, Oct 12, 2012 at 8:47 AM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: I have a process that uses PyTabels and opens a bunch of HDF5 files in read-only mode. I know that if I don't close these files that AtExit hook will close the open files and display the message: Closing remaining open files: My question is simple is it possible for me to run into any corruption issues by not explicitly closing files that have been opened in read-only mode? Hello Aquil, I don't think that you will have any issues with doing this. However, I would just go ahead and close all of the files anyway. The 'with' statement is great for that. Also, recall line 2 of the Zen of Python: Explicit is better than implicit. Be Well Anthony -- Aquil H. Abdullah I never think of the future. It comes soon enough - Albert Einstein -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables hangs while opening file in worker process
Hmm sorry to hear that Owen Let me know how it goes. On Thu, Oct 11, 2012 at 11:07 AM, Owen Mackwood owen.mackw...@bccn-berlin.de wrote: Hi Anthony, I tried your suggestion and it has not solved the problem. It could be that it makes the problem go away in the test code because it changes the timing of the processes. I'll see if I can modify the test code to reproduce the hang even with reloading the tables module. Regards, Owen On 10 October 2012 22:00, Anthony Scopatz scop...@gmail.com wrote: So Owen, I am still not sure what the underlying problem is, but I altered your parallel function to forciably reload pytables each time it is called. This seemed to work perfectly on my larger system but not at all on my smaller one. If there is a way that you can isolate pytables and not import it globally at all, it might work even better. Below is the code snippet. I hope this helps. Be Well Anthony def run_simulation_single((paramspace_pt, params)): import sys rmkeys = [key for key in sys.modules if key.startswith('tables')] for key in rmkeys: del sys.modules[key] import traceback import tables try: filename = params['results_file'] -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables hangs while opening file in worker process
Hi Owen, So just to confirm this behavior, having run your sample on a couple of my machines, what you see is that the code looks like it gets all the way to the end, and then it stalls right before it is about to exit, leaving some small number of processes (here names python tables_test.py) in the OS. Is this correct? It seems to be the case that these failures do not happen when I set the processor pool size to be less than or equal to the number of processors (physical or hyperthreaded) that I have on the machine. I was testing this both on an 32 proc cluster and my dual core laptop. Is this also the behavior you have seen? Be Well Anthony On Tue, Oct 9, 2012 at 8:08 AM, Owen Mackwood owen.mackw...@bccn-berlin.dewrote: Hi Anthony, I've created a reduced example which reproduces the error. I suppose the more processes you can run in parallel the more likely it is you'll see the hang. On a machine with 8 cores, I see 5-6 processes hang out of 2000. All of the hung tasks had a call stack that looked like this: #0 0x7fc8ecfd01fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x7fc8ebd9d215 in H5TS_mutex_lock () from /usr/lib/libhdf5.so.6 #2 0x7fc8ebaacff0 in H5open () from /usr/lib/libhdf5.so.6 #3 0x7fc8e224c6a4 in __pyx_pf_6tables_13hdf5Extension_4File__g_new (__pyx_v_self=0x28b35a0, __pyx_args=value optimized out, __pyx_kwds=value optimized out) at tables/hdf5Extension.c:2820 #4 0x004abf62 in ext_do_call (f=0x271f4c0, throwflag=value optimized out) at Python/ceval.c:4331 #5 PyEval_EvalFrameEx (f=0x271f4c0, throwflag=value optimized out) at Python/ceval.c:2705 #6 0x004ada51 in PyEval_EvalCodeEx (co=0x247aeb0, globals=value optimized out, locals=value optimized out, args=0x288cea0, argcount=0, kws=value optimized out, kwcount=0, defs=0x25ffd78, defcount=4, closure=0x0) at Python/ceval.c:3253 I've attached the code to reproduce this. It probably isn't quite minimal, but it is reasonably simple (and stereotypical of the kind of operations I use). Let me know if you need anything else, or have questions about my code. Regards, Owen On 8 October 2012 17:37, Anthony Scopatz scop...@gmail.com wrote: Hello Owen, So __getitem__() calls read() on the items it needs. Both should return a copy in-memory of the data that is on disk. Frankly, I am not really sure what is going on, given what you have said. A minimal example which reproduces the error would be really helpful. From the error that you have provided, though, the only thing that I can think of is that it is related to file opening on the worker processes. Be Well Anthony -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables hangs while opening file in worker process
So Owen, I am still not sure what the underlying problem is, but I altered your parallel function to forciably reload pytables each time it is called. This seemed to work perfectly on my larger system but not at all on my smaller one. If there is a way that you can isolate pytables and not import it globally at all, it might work even better. Below is the code snippet. I hope this helps. Be Well Anthony def run_simulation_single((paramspace_pt, params)): import sys rmkeys = [key for key in sys.modules if key.startswith('tables')] for key in rmkeys: del sys.modules[key] import traceback import tables try: filename = params['results_file'] On Wed, Oct 10, 2012 at 2:06 PM, Owen Mackwood owen.mackw...@bccn-berlin.de wrote: On 10 October 2012 20:08, Anthony Scopatz scop...@gmail.com wrote: So just to confirm this behavior, having run your sample on a couple of my machines, what you see is that the code looks like it gets all the way to the end, and then it stalls right before it is about to exit, leaving some small number of processes (here names python tables_test.py) in the OS. Is this correct? More or less. What's really happening is that if your processor pool has N processes, then each time one of the workers hangs the pool will have N-1 processes running thereafter. Eventually when all the tasks have completed (or all workers are hung, something that has happened to me when processing many tasks), the main process will just block waiting for the hung processes. If you're running Linux, when the test is finished and the main process is still waiting on the hung processes, you can just kill the main process. The orphaned processes that are still there afterward are the ones of interest. It seems to be the case that these failures do not happen when I set the processor pool size to be less than or equal to the number of processors (physical or hyperthreaded) that I have on the machine. I was testing this both on an 32 proc cluster and my dual core laptop. Is this also the behavior you have seen? No, I've never noticed that to be the case. It appears that the greater the true parallelism (ie - physical cores on which there are workers executing in parallel) the greater the odds of there being a hang. I don't have any real proof of this though; as with most concurrency bugs, it's tough to be certain of anything. Regards, Owen -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables hangs while opening file in worker process
On Mon, Oct 8, 2012 at 5:13 AM, Owen Mackwood owen.mackw...@bccn-berlin.dewrote: Hi Anthony, There is a single multiprocessing.Pool which usually has 6-8 processes, each of which is used to run a single task, after which a new process is created for the next task (maxtasksperchild=1 for the Pool constructor). There is a master process that regularly opens an HDF5 file to read out information for the worker processes (data that gets copied into a dictionary and passed as args to the worker's target function). There are no problems with the master process, it never hangs. Hello Owen, Hmmm, Are you actually copying the data (f.root.data[:]) or are you simply passing a reference as arguments (f.root.data)? The failure appears to be random, affecting less than 2% of my tasks (all tasks are highly similar and should call the same tables functions in the same order). This is running on Debian Squeeze, Python 2.7.3, PyTables 2.4.0. As far as the particular function that hangs... tough to say since I haven't yet been able to properly debug the issue. The interpreter hangs which limits my ability to diagnose the source of the problem. I call a number of functions in the tables module from the worker process, including openFile, createVLArray, createCArray, createGroup, flush, and of course close. So if you are opening a file in the master process and then writing/creating/flushing from the workers this may cause a problem. Multiprocess creates a fork of the original process so you are relying on the file handle from the master process to not accidentally change somehow. Can you try to open the files in the workers rather than the master? I hope that this clears up the issue. Basically, I am advocating a more conservative approach where all data that is read or written to in a worker must come from that worker, rather than being generated by the master. If you are *still* experiencing these problems, then we know we have a real problem. Also if this doesn't fix it, if you could send us a small sample module which reproduces this issue, that would be great too! Be Well Anthony I'll continue to try and find out more about when and how the hang occurs. I have to rebuild Python to allow the gdb pystack macro to work. If you have any suggestions for me, I'd love to hear them. Regards, Owen On 7 October 2012 00:28, Anthony Scopatz scop...@gmail.com wrote: Hi Owen, How many pools do you have? Is this a random runtime failure? What kind of system is this one? Is there some particular fucntion in Python that you are running? (It seems to be openFile(), but I can't be sure...) The error is definitely happening down in the H5open() routine. Now whether this is HDF5's fault or ours, I am not yet sure. Be Well Anthony -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Installation test failed: ImportError
Hello John, You probably installed globally and are trying to test locally. Either leave off the Pythonpath or try testing from a location other than the root pytables dir. Be Well Anthony On Mon, Oct 8, 2012 at 4:23 PM, Dickson, John Robert john_dick...@hms.harvard.edu wrote: Hello, I am trying to install PyTables, but when testing it with the command: env PYTHONPATH=. python -c import tables; tables.test() It returned the following: Traceback (most recent call last): File string, line 1, in module File tables/__init__.py, line 30, in module from tables.utilsExtension import getPyTablesVersion, getHDF5Version ImportError: No module named utilsExtension I am using Mac OS X 10.8.2. Please let me know if you need any additional information. I would appreciate any suggestions on what the problem may be and how to correct it. Thanks, John -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] EArray
Hi Andre, You can use tuple addition to accomplish what you want: (0,) + data.shape == (0,256,1,2) Be Well Anthony On Sat, Oct 6, 2012 at 12:42 PM, Andre' Walker-Loud walksl...@gmail.comwrote: Hi All, I have a bunch of hdf5 files I am using to create one hdf5 file. Each individual file has many different pieces of data, and they are all the same shape in each file. I am using createEArray to make the large array in the final file. if the data files in the individual h5 files are of shape (256,1,2), then I have to use createEArray('/path/','name',tables.floatAtom64(),(0,256,1,2),expectedrows=len(data_files)) if the np array I have grabbed from an individual file to append to my EArray is defined as data, is there a way to use data.shape to create the shape of my EArray? In spirit, I want to do something like (0,data.shape) but this does not work. I have been scouring the numpy manual to see how to convert data.shape (256,1,2) to (0,256,1,2) but failed to figure this out (if I don't know ahead of time the shape of data - in which case I can manually reshape). Thanks, Andre -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] PyTables hangs while opening file in worker process
Hello Owen, While you can use process pools to read from a file in parallel just fine, writing is another story completely. While HDF5 itself supports parallel writing though MPI, this comes at the high cost of compression no longer being available and a much more complicated code base. So for the time being, PyTables only supports the serial HDF5 library. Therefore if you want to write to a file in parallel, you adopt a strategy where you have one process which is responsible for all of the writing and all other processes send their data to this process instead of writing to file directly. This is a very effective way of accomplishing basically what you need. In fact, we have an example to do just that [1]. (As a side note: HDF5 may soon be adding an API for exactly this pattern because it comes up so often.) So if I were you, I would look at [1] and adopt it to my use case. Be Well Anthony 1. https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py On Fri, Oct 5, 2012 at 9:55 AM, Owen Mackwood owen.mackw...@bccn-berlin.dewrote: Hello, I'm using a multiprocessing.Pool to parallelize a set of tasks which record their results into separate hdf5 files. Occasionally (less than 2% of the time) the worker process will hang. According to gdb, the problem occurs while opening the hdf5 file, when it attempts to obtain the associated mutex. Here's part of the backtrace: #0 0x7fb2ceaa716c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x7fb2be61c215 in H5TS_mutex_lock () from /usr/lib/libhdf5.so.6 #2 0x7fb2be32bff0 in H5open () from /usr/lib/libhdf5.so.6 #3 0x7fb2b96226a4 in __pyx_pf_6tables_13hdf5Extension_4File__g_new (__pyx_v_self=0x7fb2b04867d0, __pyx_args=value optimized out, __pyx_kwds=value optimized out) at tables/hdf5Extension.c:2820 #4 0x004abf62 in ext_do_call (f=0x4cb2430, throwflag=value optimized out) at Python/ceval.c:4331 Nothing else is trying to open this file, so can someone suggest why this is occurring? This is a very annoying problem as there is no way to recover from this error, and consequently the worker process is permanently occupied, which effectively removes one of my processors from the pool. Regards, Owen Mackwood -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Optimizing pytables for reading entire columns at a time
On Fri, Sep 28, 2012 at 2:46 AM, Francesc Alted fal...@pytables.org wrote: On 9/27/12 8:10 PM, Anthony Scopatz wrote: I think I remember seeing there was a performance limit with tables 255 columns. I can't find a reference to that so it's possible I made it up. However, I was wondering if carrays had some limitation like that. Tables are a different data set. The issue with tables is that column metadata (names, etc.) needs to fit in the attribute space. The size of this space is statically limited to 64 kb. In my experience, this number is in the thousands of columns (not hundreds). For the record, the PerformanceWarning issued by PyTables has nothing to do with the attribute space, but rather to the fact that putting too many columns in the same table means that you have to retrieve much more data even if you are retrieving only one single column. Also, internal I/O buffers have to be much more larger, and compressors tend to work much less efficiently too. On the other hand CArrays don't have much of any column metadata. CArrays should scale to an infinite number of columns without any issue. Yeah, they should scale better, although saying they can reach infinite scalability is a bit audacious :) All the CArrays are datasets that have to be saved internally by HDF5, and that requires quite a few of resources to have track of them. True, but I would ague that this is effectively infinite if you set your chunksize appropriately large. I have never the run into an issue with HDF5 where the number of rows or columns on its own becomes too large for arrays. However, it is relatively easy to reach this limit with tables (both in PyTables and the HL interface). So maybe I should have said effectively infinite ;) -- Francesc Alted -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Optimizing pytables for reading entire columns at a time
On Thu, Sep 27, 2012 at 11:02 AM, Luke Lee durdenm...@gmail.com wrote: Are there any performance issues with relatively large carrays? For example, say I have a carray with 300,000 float64s in it. Is there some threshold where I could expect performance to degrade or anything? Hello Luke, The breakdowns happen when you have too many chunks. However you are well away from this threshold (which is ~20,000). I believe that the PyTables will issue a warning or error when you reach this point anyways. I think I remember seeing there was a performance limit with tables 255 columns. I can't find a reference to that so it's possible I made it up. However, I was wondering if carrays had some limitation like that. Tables are a different data set. The issue with tables is that column metadata (names, etc.) needs to fit in the attribute space. The size of this space is statically limited to 64 kb. In my experience, this number is in the thousands of columns (not hundreds). On the other hand CArrays don't have much of any column metadata. CArrays should scale to an infinite number of columns without any issue. Be Well Anthony -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] where() with start/stop args returning incorrect result set
Hi Derek, Ok That is very strange. I cannot reproduce this on any of my data. A quick couple of extra questions: 1) Does this still happen when you set start=0? 2) What is the chunksize of this data set (are you at a boundary)? 3) Could you send us the full table information, ie repr(table). Be Well Anthony On Tue, Sep 25, 2012 at 12:42 AM, Derek Shockey derek.shoc...@gmail.comwrote: I ran the tests. All 4988 passed. The information it output is: PyTables version: 2.4.0 HDF5 version: 1.8.9 NumPy version: 1.6.2 Numexpr version: 2.0.1 (not using Intel's VML/MKL) Zlib version: 1.2.5 (in Python interpreter) LZO version: 2.06 (Aug 12 2011) BZIP2 version: 1.0.6 (6-Sept-2010) Blosc version: 1.1.3 (2010-11-16) Cython version:0.16 Python version:2.7.3 (default, Jul 6 2012, 00:17:51) [GCC 4.2.1 Compatible Apple Clang 3.1 (tags/Apple/clang-318.0.58)] Platform: darwin-x86_64 Byte-ordering: little Detected cores:4 -Derek On Mon, Sep 24, 2012 at 9:09 PM, Anthony Scopatz scop...@gmail.com wrote: Hi Derek, Can you please run the following command and report back what you see? python -c import tables; tables.test() Be Well Anthony On Mon, Sep 24, 2012 at 10:56 PM, Derek Shockey derek.shoc...@gmail.com wrote: Hello, I'm hoping someone can help me. When I specify start and stop values for calls to where() and readWhere(), it is returning blatantly incorrect results: table.readWhere(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows)[0]['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' table.where(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows).next()['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' This happens with a sequential block of about 150 rows of data, and each time it seems to be 8 rows off (i.e. the row it returns is 8 rows ahead of the row it should be returning). If I remove the start and stop args, it behaves correctly. This seems to be a bug, unless I am misunderstanding something. I'm using Python 2.7.3, PyTables 2.4.0, and hdf5 1.8.9 on OS X 10.8.2. Any ideas? Thanks, Derek -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] where() with start/stop args returning incorrect result set
Hello Derek, and devs, After playing around with your data, I am able to reproduce this error on my system. I am not sure exactly where the problem is but I do know how to fix it! It turns out that this is an issue with the indexes not being properly in sync with the original table OR the start and stop values are not being propagated properly down to the indexes. When I tried to reindex by calling table.reIndex(), this did not fix the issue. This makes me think that the problem is propagating start, stop, and step all the way through correctly. I'll go ahead an make a ticket reflecting this. That said, the way to fix this in the short term is to do one of the following 1) Only use start=0, and step=1 (I bet that other stop values work) 2) Don't use indexes. When I removed the indexes from the file using ptrepack analysis.h5 analysis2.h5, everything worked fine. Thanks a ton for reporting this! Be Well Anthony On Tue, Sep 25, 2012 at 12:30 PM, Derek Shockey derek.shoc...@gmail.comwrote: Hi Anthony, It doesn't happen if I set start=0 or seemingly any number below 3257 (though I didn't try them *all*). I am new to PyTables and hdf5, so I'm not sure about the chunksize or if I'm at a boundary. I did however notice that the table's chunkshape is 203, and this happens for exactly 203 sequential records, so I doubt that's a coincidence. The table description is below. Thanks, Derek /events (Table(5988,)) '' description := { client_id: StringCol(itemsize=24, shape=(), dflt='', pos=0), data_01: StringCol(itemsize=36, shape=(), dflt='', pos=1), data_02: StringCol(itemsize=36, shape=(), dflt='', pos=2), data_03: StringCol(itemsize=36, shape=(), dflt='', pos=3), data_04: StringCol(itemsize=36, shape=(), dflt='', pos=4), data_05: StringCol(itemsize=36, shape=(), dflt='', pos=5), device_id: StringCol(itemsize=36, shape=(), dflt='', pos=6), id: StringCol(itemsize=36, shape=(), dflt='', pos=7), timestamp: Time64Col(shape=(), dflt=0.0, pos=8), type: UInt16Col(shape=(), dflt=0, pos=9), user_id: StringCol(itemsize=36, shape=(), dflt='', pos=10)} byteorder := 'little' chunkshape := (203,) autoIndex := True colindexes := { timestamp: Index(9, full, shuffle, zlib(1)).is_CSI=True, type: Index(9, full, shuffle, zlib(1)).is_CSI=True, id: Index(9, full, shuffle, zlib(1)).is_CSI=True, user_id: Index(9, full, shuffle, zlib(1)).is_CSI=True} On Tue, Sep 25, 2012 at 9:32 AM, Anthony Scopatz scop...@gmail.com wrote: Hi Derek, Ok That is very strange. I cannot reproduce this on any of my data. A quick couple of extra questions: 1) Does this still happen when you set start=0? 2) What is the chunksize of this data set (are you at a boundary)? 3) Could you send us the full table information, ie repr(table). Be Well Anthony On Tue, Sep 25, 2012 at 12:42 AM, Derek Shockey derek.shoc...@gmail.com wrote: I ran the tests. All 4988 passed. The information it output is: PyTables version: 2.4.0 HDF5 version: 1.8.9 NumPy version: 1.6.2 Numexpr version: 2.0.1 (not using Intel's VML/MKL) Zlib version: 1.2.5 (in Python interpreter) LZO version: 2.06 (Aug 12 2011) BZIP2 version: 1.0.6 (6-Sept-2010) Blosc version: 1.1.3 (2010-11-16) Cython version:0.16 Python version:2.7.3 (default, Jul 6 2012, 00:17:51) [GCC 4.2.1 Compatible Apple Clang 3.1 (tags/Apple/clang-318.0.58)] Platform: darwin-x86_64 Byte-ordering: little Detected cores:4 -Derek On Mon, Sep 24, 2012 at 9:09 PM, Anthony Scopatz scop...@gmail.com wrote: Hi Derek, Can you please run the following command and report back what you see? python -c import tables; tables.test() Be Well Anthony On Mon, Sep 24, 2012 at 10:56 PM, Derek Shockey derek.shoc...@gmail.com wrote: Hello, I'm hoping someone can help me. When I specify start and stop values for calls to where() and readWhere(), it is returning blatantly incorrect results: table.readWhere(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows)[0]['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' table.where(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows).next()['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' This happens with a sequential block of about 150 rows of data, and each time it seems to be 8 rows off (i.e. the row it returns is 8 rows ahead of the row it should be returning). If I remove the start and stop args, it behaves correctly. This seems to be a bug, unless I am misunderstanding something. I'm using Python 2.7.3, PyTables 2.4.0, and hdf5 1.8.9 on OS X 10.8.2. Any ideas? Thanks, Derek -- Live Security Virtual Conference Exclusive live event will cover
Re: [Pytables-users] where() with start/stop args returning incorrect result set
Hi Derek, Can you please run the following command and report back what you see? python -c import tables; tables.test() Be Well Anthony On Mon, Sep 24, 2012 at 10:56 PM, Derek Shockey derek.shoc...@gmail.comwrote: Hello, I'm hoping someone can help me. When I specify start and stop values for calls to where() and readWhere(), it is returning blatantly incorrect results: table.readWhere(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows)[0]['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' table.where(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows).next()['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' This happens with a sequential block of about 150 rows of data, and each time it seems to be 8 rows off (i.e. the row it returns is 8 rows ahead of the row it should be returning). If I remove the start and stop args, it behaves correctly. This seems to be a bug, unless I am misunderstanding something. I'm using Python 2.7.3, PyTables 2.4.0, and hdf5 1.8.9 on OS X 10.8.2. Any ideas? Thanks, Derek -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] where() with start/stop args returning incorrect result set
PS When I do this on linux all 5077 tests pass for me. On Mon, Sep 24, 2012 at 11:09 PM, Anthony Scopatz scop...@gmail.com wrote: Hi Derek, Can you please run the following command and report back what you see? python -c import tables; tables.test() Be Well Anthony On Mon, Sep 24, 2012 at 10:56 PM, Derek Shockey derek.shoc...@gmail.comwrote: Hello, I'm hoping someone can help me. When I specify start and stop values for calls to where() and readWhere(), it is returning blatantly incorrect results: table.readWhere(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows)[0]['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' table.where(id == 'ceec536a-394e-4dd7-a182-eea557f3bb93', start=3257, stop=table.nrows).next()['id'] '7f589d3e-a0e1-4882-b69b-0223a7de3801' This happens with a sequential block of about 150 rows of data, and each time it seems to be 8 rows off (i.e. the row it returns is 8 rows ahead of the row it should be returning). If I remove the start and stop args, it behaves correctly. This seems to be a bug, unless I am misunderstanding something. I'm using Python 2.7.3, PyTables 2.4.0, and hdf5 1.8.9 on OS X 10.8.2. Any ideas? Thanks, Derek -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Optimizing pytables for reading entire columns at a time
On Fri, Sep 21, 2012 at 10:49 AM, Luke Lee durdenm...@gmail.com wrote: Hi again, I haven't been getting the updates via email so I'm attempting to post again to respond. Thanks everyone for the suggestions. I have a few questions: 1. What is the benefit of using the stand-alone carray project ( https://github.com/FrancescAlted/carray) vs Pytables.carray? Hello Luke, carrays are in-memory, not on disk. 2. I realized my code base never uses the query functionality of a Table. So, I changed all my columns to be just Pytables.carray objects instead. They are all sitting at the top of the hierarchy, just below root. Is this a good idea? I see a big speed increase from this obviously because now everything is stored contiguously. However, are there any downsides to doing this? I suppose I could also use EArray, but we are never actually changing the data once it is stored in HDF5. If it works for you, then great! 3. Is compression automatically happening with the Carray? I know the documentation says that compression is supported, but what do I need to do to enable it? Maybe it's already happening and this is contributing to my big speed improvement. For compression to be enabled, you need to define the appropriate filter [1] on either the node or the file. 4. I would certainly love to take a look at contributing something like this in my free time. I don't have a whole lot at this time so the changes could take a while. I'm sure I need to learn a lot more about the codebase before really giving it a try. I'm going to take a look at this though, thanks for the suggestion! No problem ;) 5. How do I subscribe to the dev mailing list? I only see announcements and users. Here is the dev list site: https://groups.google.com/forum/?fromgroups#!forum/pytables-dev 6. Any idea why I'm not getting the emails from the list? I signed up 2 days ago and didn't get any of your replies via email. We have been having problems with this list. I think It might be time to transition... Be Well Anthony 1. http://pytables.github.com/usersguide/libref/helper_classes.html?highlight=filter#tables.Filters -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Optimizing pytables for reading entire columns at a time
On Fri, Sep 21, 2012 at 4:55 PM, Francesc Alted fal...@gmail.com wrote: On 9/21/12 10:07 PM, Anthony Scopatz wrote: On Fri, Sep 21, 2012 at 10:49 AM, Luke Lee durdenm...@gmail.com mailto:durdenm...@gmail.com wrote: Hi again, I haven't been getting the updates via email so I'm attempting to post again to respond. Thanks everyone for the suggestions. I have a few questions: 1. What is the benefit of using the stand-alone carray project (https://github.com/FrancescAlted/carray) vs Pytables.carray? Hello Luke, carrays are in-memory, not on disk. Well, that was true until version 0.5 where disk persistency was introduced. Now, carray supports both in-memory and on-disk objects, and they work exactly in the same way. Sorry for not being exactly up to date ;) -- Francesc Alted -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] [ANN] Blosc 1.1.4 released
Great! Thanks to you both. On Sun, Sep 16, 2012 at 11:42 AM, Antonio Valentino antonio.valent...@tiscali.it wrote: Hi Francesc, thank you. Just pushed updates into pytables. ciao Il 16/09/2012 12:07, Francesc Alted ha scritto: === Announcing Blosc 1.1.4 A blocking, shuffling and lossless compression library === What is new? - Redefinition of the BLOSC_MAX_BUFFERSIZE constant as (INT_MAX - BLOSC_MAX_OVERHEAD) instead of just INT_MAX. This prevents to produce outputs larger than INT_MAX, which is not supported. - `exit()` call has been replaced by a ``return -1`` in blosc_compress() when checking for buffer sizes. Now programs will not just exit when the buffer is too large, but return a negative code. - Improvements in explicit casts. Blosc compiles without warnings (with GCC) now. - Lots of improvements in docs, in particular a nice ascii-art diagram of the Blosc format (Valentin Haenel). - [HDF5 filter] Adapted HDF5 filter to use HDF5 1.8 by default (Antonio Valentino). For more info, please see the release notes in: https://github.com/FrancescAlted/blosc/wiki/Release-notes -- Antonio Valentino -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] a question about data corruption.
Hello Gelin, Unless you were using the undo / redo mechanism, then I don't think that there is. You'll probably have to fix the file manually using PyTables normally and the provided tools like ptrepack. Be Well Anthony On Sun, Sep 16, 2012 at 12:22 PM, gelin yan dynami...@gmail.com wrote: Hi All I have a question about data corruption. Is it possible to repair data file when there is a situation like power outage or process crash? I have poked around the manual; however I did fail to find anything about how to repair corrupted data if it happened. Thanks Regards gelin yan -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
[Pytables-users] Fwd: A sad day for our community. John Hunter: 1968-2012.
Passing the bad news along, in case you hadn't heard. -- Forwarded message -- From: Fernando Perez fperez@gmail.com Date: Wed, Aug 29, 2012 at 9:32 PM Subject: A sad day for our community. John Hunter: 1968-2012. To: matplotlib development list matplotlib-de...@lists.sourceforge.net, Matplotlib Users matplotlib-us...@lists.sourceforge.net, IPython Development list ipython-...@scipy.org, IPython User list ipython-u...@scipy.org, Discussion of Numerical Python numpy-discuss...@scipy.org, SciPy Developers List scipy-...@scipy.org, SciPy Users List scipy-u...@scipy.org, numfo...@googlegroups.com, pyd...@googlegroups.com, scikit-learn-general scikit-learn-gene...@lists.sourceforge.net, networkx-discuss networkx-disc...@googlegroups.com, sage-devel sage-de...@googlegroups.com, pystatsmod...@googlegroups.com, enthought-dev enthought-...@mail.enthought.com, yt-...@lists.spacepope.org Dear friends and colleagues, I am terribly saddened to report that yesterday, August 28 2012 at 10am, John D. Hunter died from complications arising from cancer treatment at the University of Chicago hospital, after a brief but intense battle with this terrible illness. John is survived by his wife Miriam, his three daughters Rahel, Ava and Clara, his sisters Layne and Mary, and his mother Sarah. Note: If you decide not to read any further (I know this is a long message), please go to this page for some important information about how you can thank John for everything he gave in a decade of generous contributions to the Python and scientific communities: http://numfocus.org/johnhunter. Just a few weeks ago, John delivered his keynote address at the SciPy 2012 conference in Austin centered around the evolution of matplotlib: http://www.youtube.com/watch?v=e3lTby5RI54 but tragically, shortly after his return home he was diagnosed with advanced colon cancer. This diagnosis was a terrible discovery to us all, but John took it with his usual combination of calm and resolve, and initiated treatment procedures. Unfortunately, the first round of chemotherapy treatments led to severe complications that sent him to the intensive care unit, and despite the best efforts of the University of Chicago medical center staff, he never fully recovered from these. Yesterday morning, he died peacefully at the hospital with his loved ones at his bedside. John fought with grace and courage, enduring every necessary procedure with a smile on his face and a kind word for all of his caretakers and becoming a loved patient of the many teams that ended up involved with his case. This was no surprise for those of us who knew him, but he clearly left a deep and lasting mark even amongst staff hardened by the rigors of oncology floors and intensive care units. I don't need to explain to this community the impact of John's work, but allow me to briefly recap, in case this is read by some who don't know the whole story. In 2002, John was a postdoc at the University of Chicago hospital working on the analysis of epilepsy seizure data in children. Frustrated with the state of the existing proprietary solutions for this class of problems, he started using Python for his work, back when the scientific Python ecosystem was much, much smaller than it is today and this could have been seen as a crazy risk. Furthermore, he found that there were many half-baked solutions for data visualization in Python at the time, but none that truly met his needs. Undeterred, he went on to create matplotlib (http://matplotlib.org) and thus overcome one of the key obstacles for Python to become the best solution for open source scientific and technical computing. Matplotlib is both an amazing technical achievement and a shining example of open source community building, as John not only created its backbone but also fostered the development of a very strong development team, ensuring that the talent of many others could also contribute to this project. The value and importance of this are now painfully clear: despite having lost John, matplotlib continues to thrive thanks to the leadership of Michael Droetboom, the support of Perry Greenfield at the Hubble Telescope Science Institute, and the daily work of the rest of the team. I want to thank Perry and Michael for putting their resources and talent once more behind matplotlib, securing the future of the project. It is difficult to overstate the value and importance of matplotlib, and therefore of John's contributions (which do not end in matplotlib, by the way; but a biography will have to wait for another day...). Python has become a major force in the technical and scientific computing world, leading the open source offers and challenging expensive proprietary platforms with large teams and millions of dollars of resources behind them. But this would be impossible without a solid data visualization tool that would allow both ad-hoc data exploration and the production of complex, fine-tuned
Re: [Pytables-users] import error
Hi John, This is probably a path issue. You likely have both pytables installed and a 'tables' source sub-directory wherever you are running this from. For whatever reason, it is picking up the source version rather than the installed version. It is either that, or you simply don't have it installed correctly. The file it is missing is one which gets compiled when you run python setup.py install Be Well Anthony On Sat, Aug 18, 2012 at 2:28 AM, John cloverev...@yahoo.com wrote: I have the following import error when I try to import pytable ImportError Traceback (most recent call last)ipython-input-2-389ecae14f10 in module() 1 import tables /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/tables/__init__.py in module() 28 29 # Necessary imports to get versions stored on the Pyrex extension--- 30 from tables.utilsExtension import getPyTablesVersion, getHDF5Version 31 32 ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/tables/utilsExtension.so, 2): Symbol not found: _H5E_CALLBACK_g Referenced from: /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/tables/utilsExtension.so Expected in: flat namespace in /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/tables/utilsExtension.so Anyone knows whats wrong with it? -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Searching for nan values in a table...
So this is probably a numexpr issue. There doesn't seem to be an isnan() implementation [1]. I would bring it up with them. Sorry we can't do more. Be Well Anthony 1. http://code.google.com/p/numexpr/wiki/UsersGuide On Thu, Aug 16, 2012 at 12:57 PM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: I get the same error if I use: bad_vols = tbl.getWhereList('volume == nan') bad_vols = tbl.getWhereList('volume == NaN') -- Aquil H. Abdullah I never think of the future. It comes soon enough - Albert Einstein On Thursday, August 16, 2012 at 1:52 PM, Anthony Scopatz wrote: Have you tried simply doing: 'volume == nan' or 'volume == NaN' On Thu, Aug 16, 2012 at 12:49 PM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: Hello All, I am trying to determine if there are any NaN values in one of my tables, but when I queried for numpy.nan I received a NameError. Can any tell be the best way to search for a NaN value? Thanks! In [7]: type(np.nan) Out[7]: float In [8]: bad_vols = tbl.getWhereList('volume == %f' % np.nan) --- NameError Traceback (most recent call last) /Users/aquilabdullah/ipython-input-8-2c1b183b0581 in module() 1 bad_vols = tbl.getWhereList('volume == %f' % np.nan) /Library/Python/2.7/site-packages/tables/table.pyc in getWhereList(self, condition, condvars, sort, start, stop, step) 1540 1541 coords = [ p.nrow for p in - 1542self._where(condition, condvars, start, stop, step) ] 1543 coords = numpy.array(coords, dtype=SizeType) 1544 # Reset the conditions /Library/Python/2.7/site-packages/tables/table.pyc in _where(self, condition, condvars, start, stop, step) 1434 1435 # Compile the condition and extract usable index conditions. - 1436 condvars = self._requiredExprVars(condition, condvars, depth=3) 1437 compiled = self._compileCondition(condition, condvars) 1438 /Library/Python/2.7/site-packages/tables/table.pyc in _requiredExprVars(self, expression, uservars, depth) 1207 val = user_globals[var] 1208 else: - 1209 raise NameError(name ``%s`` is not defined % var) 1210 1211 # Check the value. NameError: name ``nan`` is not defined -- Aquil H. Abdullah I never think of the future. It comes soon enough - Albert Einstein -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Numpy views stored as attributes
Hello Ask, I bet this is because you are storing these as attrs...which will default back to some pickled Python representation. Can you check if this works as expected when saving as actual arrays. Something like: import numpy as np import tables with tables.openFile(test.h5, w) as f: A=np.array([[0,1],[2,3]]) a=f.createArray(/, a, A) b=f.createArray(/, b, A.T.copy()) c=f.createArray(/, c, A.T) assert np.all(a==A) assert np.all(b==A.T) assert np.all(c==A) # AssertionError! assert np.all(c==A.T) Be Well Anthony On Wed, Aug 15, 2012 at 4:13 AM, Ask F. Jakobsen a...@linet.dk wrote: Hey all, When I store a view of a numpy array as an attribute it appears to be stored as the array that owns the data. Is this a bug? I find it confusing that the user has to check if the numpy array owns the data or always remember to do a copy() before storing a numpy array as an attribute. Below is some sample code that highlights the problem. Best regards, Ask import numpy as np import tables with tables.openFile(test.h5, w) as f: x=f.createArray(/, test, [0]) A=np.array([[0,1],[2,3]]) x.attrs['a']=A x.attrs['b']=A.T.copy() x.attrs['c']=A.T assert np.all(x.attrs['a']==A) assert np.all(x.attrs['b']==A.T) assert np.all(x.attrs['c']==A) assert np.all(x.attrs['c']==A.T) # AssertionError! -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] In-kernal for subset?
On Wed, Aug 15, 2012 at 12:33 PM, Adam Dershowitz adershow...@exponent.comwrote: I am trying to find all cases where a value transitions above a threshold. So, my code first does a getwherelist to find values that are above the threshold, then it uses that list to find immediately prior values that are below. The code is working, but the second part, searching through just a smaller subset is much slower (First search is on the order of 1 second, while the second is a minute). Is there any way to get this second part of the search in-kernal? Or any more general way to do a search for values above a threshold, where the prior value is below? Essentially, what I am looking for is a way to speed up that second search for all rows in a prior defined list, where a condition is applied to the table My table is just seconds and values, in chronological order. Here is the code that I am using now: h5data = tb.openFile(AllData.h5,r) table1 = h5data.root.table1 #Find all values above threshold: thelist= table1.getWhereList(Value 150) #From the above list find all values where the immediately prior value is below: transition=[] for i in thelist: if (table1[i-1]['Value'] 150) and (i != 0) : transition.append(i) Hey Adam, Sorry for taking a while to respond. Assuming you don't mind one of these being = or =, you don't really need the second loop with a little index arithmetic: import numpy as np inds = np.array(thelist) dinds = inds[1:] - inds[:-1] transition = dinds[(1 dinds)] This should get you an array of all of the transition indices since wherever the difference in indices is greater than 1 the Value must have dropped below the threshold and then returned back up. Be Well Anthony Thanks, -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] openFile strategy question
Hi Andre, I am a little confused. Let me verify. You have 400 hdf5 file (re and im) buried in an a unix directory tree. You want to make a single file which concatenates this data. Is this right? Be Well Anthony On Wed, Aug 15, 2012 at 6:52 PM, Andre' Walker-Loud walksl...@gmail.comwrote: Hi All, Just a strategy question. I have many hdf5 files containing data for different measurements of the same quantities. My directory tree looks like top description [ group ] sub description [ group ] avg [ group ] re [ numpy array shape = (96,1,2) ] im [ numpy array shape = (96,1,2) ] - only exists for know subset of data files I have ~400 of these files. What I want to do is create a single file, which collects all of these files with exactly the same directory structure, except at the very bottom re [ numpy array shape = (400,96,1,2) ] The simplest thing I came up with to do this is loop over the two levels of descriptive group structures, and build the numpy array for the final set this way. basic loop structure: final_file = tables.openFile('all_data.h5','a') for d1 in top_description: final_file.createGroup(final_file.root,d1) for d2 in sub_description: final_file.createGroup(final_file.root+'/'+d1,d2) data_re = np.zeros([400,96,1,2]) for i,file in enumerate(hdf5_files): tmp = tables.openFile(file) data_re[i] = np.array(tmp.getNode('/d1/d2/avg/re') tmp.close() final_file.createArray(final_file.root+'/'+d1+'/'+d2,'re',data_re) But this involves opening and closing the individual 400 hdf5 files many times. There must be a smarter algorithmic way to do this - or perhaps built in pytables tools. Any advice is appreciated. Andre -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] openFile seems to hang
On Tue, Aug 7, 2012 at 11:50 AM, Daniel Wheeler daniel.wheel...@gmail.comwrote: On Tue, Aug 7, 2012 at 12:46 PM, Anthony Scopatz scop...@gmail.comwrote: On Tue, Aug 7, 2012 at 11:43 AM, Daniel Wheeler daniel.wheel...@gmail.com wrote: They should know what do do and how to fix it. Maybe mpi init issues with either pytrilinos or mpi4py as a wild guess. Both are imported by fipy. Your guess is as good, or much better, than mine. Thanks for your questions and answers. BTW If it turns out that you need us to change something in PyTables to play nicely with fipy, please let us know! -- Daniel Wheeler -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] openFile seems to hang
Hi Daniel, Does this always happen when opening files? or just occasionally? Be Well Anthony On Mon, Aug 6, 2012 at 11:08 AM, Daniel Wheeler daniel.wheel...@gmail.comwrote: The following just seems to hang indefinitely. In [1]: import tables In [2]: f = tables.openFile('tmp.h5', mode='a') The tests hang as well. In [3]: tables.test() -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.4.0 HDF5 version: 1.8.4-patch1 NumPy version: 1.6.1 Numexpr version: 2.0.1 (not using Intel's VML/MKL) Zlib version: 1.2.3.4 (in Python interpreter) BZIP2 version: 1.0.5 (10-Dec-2007) Blosc version: 1.1.3 (2010-11-16) Cython version:0.15.1 Python version:2.6.6 (r266:84292, Dec 26 2010, 22:31:48) [GCC 4.4.5] Platform: linux2-x86_64 Byte-ordering: little Detected cores:4 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Performing only a light (yet comprehensive) subset of the test suite. If you want a more complete test, try passing the --heavy flag to this script (or set the 'heavy' parameter in case you are using tables.test() call). The whole suite will take more than 4 hours to complete on a relatively modern CPU and around 512 MB of main memory. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= /users/wd15/.virtualenvs/trunk/lib/python2.6/site-packages/tables/filters.py:253: FiltersWarning: compression library ``lzo`` is not available; using ``zlib`` instead % (complib, default_complib), FiltersWarning ) Any ideas are greatly appreciated. Thanks. -- Daniel Wheeler -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] advice on data representation
= I030_070_DESC() I030_170 = I030_170_DESC() I030_100 = I030_100_DESC() I030_180 = I030_180_DESC() I030_181 = I030_181_DESC() I030_060 = I030_060_DESC() I030_150 = I030_150_DESC() I030_140 = I030_140_DESC() I030_340 = I030_340_DESC() I030_400 = I030_400_DESC() ... I030_210 = I030_210_DESC() I030_120 = I030_120_DESC() I030_050 = I030_050_DESC() I030_270 = I030_270_DESC() I030_370 = I030_370_DESC() Från: Anthony Scopatz [mailto:scop...@gmail.com] Skickat: den 12 juli 2012 00:02 Till: Discussion list for PyTables Ämne: Re: [Pytables-users] advice on using PyTables Hello Benjamin, Not knowing to much about the ASTERIX format, other than what you said and what is in the links, I would say that this is a good fit for HDF5 and PyTables. PyTables will certainly help you read in the data and manipulate it. However, before you abandon hachoir completely, I will say it is a lot easier to write hdf5 files in PyTables than to use the HDF5 C API. If hachoir is too slow, have you tried profiling the code to see what is taking up the most time? Maybe you could just rewrite these parts in C? Have you looked into Cythonizing it? Also, you don't seem to be using numpy to read in the data... (there are some tricks given ASTERIX here, but not insurmountable). I ask the above, just so you don't have to completely rewrite everything. You are correct though that pure python is probably not sufficient. Feel free to ask more questions here. Be Well Anthony On Wed, Jul 11, 2012 at 6:52 AM, benjamin.bertr...@lfv.se wrote: Hi, I'm working with Air Traffic Management and would like to perform checks / compute statistics on ASTERIX data. ASTERIX is an ATM Surveillance Data Binary Messaging Format (http://www.eurocontrol.int/asterix/public/standard_page/overview.html) The data consist of a concatenation of consecutive data blocks. Each data block consists of data category + length + records. Each record is of variable length and consists of several data items (that are well defined for each category). Some data items might be present or not depending on a field specification (bitfield). I started to write a parser using hachoir (https://bitbucket.org/haypo/hachoir/overview) a pure python library. But the parsing was really too slow and taking a lot of memory. That's not really useable. From what I read, PyTables could really help to manipulate and analyze the data. So I've been thinking about writing a tool (probably in C) to convert my ASTERIX format to HDF5. Before I start, I'd like confirmation that this seems like a suitable application for PyTables. Is there another approach than writing a conversion tool to HDF5? Thanks in advance Benjamin -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo
Re: [Pytables-users] advice on data representation
= I030_181_DESC() I030_060 = I030_060_DESC() I030_150 = I030_150_DESC() I030_140 = I030_140_DESC() I030_340 = I030_340_DESC() I030_400 = I030_400_DESC() ... I030_210 = I030_210_DESC() I030_120 = I030_120_DESC() I030_050 = I030_050_DESC() I030_270 = I030_270_DESC() I030_370 = I030_370_DESC() ** ** ** ** *Från:* Anthony Scopatz [mailto:scop...@gmail.com] *Skickat:* den 12 juli 2012 00:02 *Till:* Discussion list for PyTables *Ämne:* Re: [Pytables-users] advice on using PyTables ** ** Hello Benjamin, ** ** Not knowing to much about the ASTERIX format, other than what you said and what is in the links, I would say that this is a good fit for HDF5 and PyTables. PyTables will certainly help you read in the data and manipulate it. ** ** However, before you abandon hachoir completely, I will say it is a lot easier to write hdf5 files in PyTables than to use the HDF5 C API. If hachoir is too slow, have you tried profiling the code to see what is taking up the most time? Maybe you could just rewrite these parts in C? Have you looked into Cythonizing it? Also, you don't seem to be using numpy to read in the data... (there are some tricks given ASTERIX here, but not insurmountable). ** ** I ask the above, just so you don't have to completely rewrite everything. You are correct though that pure python is probably not sufficient. Feel free to ask more questions here. ** ** Be Well Anthony ** ** On Wed, Jul 11, 2012 at 6:52 AM, benjamin.bertr...@lfv.se wrote: Hi, I'm working with Air Traffic Management and would like to perform checks / compute statistics on ASTERIX data. ASTERIX is an ATM Surveillance Data Binary Messaging Format ( http://www.eurocontrol.int/asterix/public/standard_page/overview.html) The data consist of a concatenation of consecutive data blocks. Each data block consists of data category + length + records. Each record is of variable length and consists of several data items (that are well defined for each category). Some data items might be present or not depending on a field specification (bitfield). I started to write a parser using hachoir ( https://bitbucket.org/haypo/hachoir/overview) a pure python library. But the parsing was really too slow and taking a lot of memory. That's not really useable. From what I read, PyTables could really help to manipulate and analyze the data. So I've been thinking about writing a tool (probably in C) to convert my ASTERIX format to HDF5. Before I start, I'd like confirmation that this seems like a suitable application for PyTables. Is there another approach than writing a conversion tool to HDF5? Thanks in advance Benjamin ** ** -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users