Re: [Pytables-users] adding column index hides data(?)
On Nov 17, 2011, at 10:35 PM, Alan Marchiori wrote: Hello, Hi Alan, I am attempting to use PyTables (v2.3.1) to store timestamped data and things were going well until I added a column index. While the column is indexed no data is returned from a table.where call! This behavior is demonstrated with the following test code: ---begin test.py--- import tables import random class Descr(tables.IsDescription): when = tables.Time64Col(pos = 1) value = tables.Float32Col(pos = 2) h5f = tables.openFile('/tmp/tmp.h5', 'w') tbl = h5f.createTable('/', 'test', Descr) tbl.cols.when.createIndex(_verbose = True) t = 1321031471.0 # 11/11/11 11:11:11 tbl.append([(t + i, random.random()) for i in range(1000)]) tbl.flush() def query(s): print 'is_index =', tbl.cols.when.is_indexed print [(row['when'], row['value']) for row in tbl.where(s)] print tbl.readWhere(wherestr) wherestr = '(when = %d) (when %d)'%(t, t+5) query(wherestr) tbl.cols.when.removeIndex() query(wherestr) h5f.close() ---end test.py--- This creates the table for storing time/value pairs, inserts some synthetic data, and then checks to see if there is data in the table. When the table is created there is an index added to the 'where' column. The first query returns no data (which is incorrect). Then the column index is removed (via table.removeIndex) and the query is repeated. This time 5 results are returned as expected. The data is clearly there however the index is somehow breaking the where logic. Here is the output I get: ---begin output--- is_index = True [] [] is_index = False [(1321031471.0, 0.6449417471885681), (1321031472.0, 0.7889317274093628), (1321031473.0, 0.609708845615387), (1321031474.0, 0.9120397567749023), (1321031475.0, 0.2386845201253891)] [(1321031471.0, 0.6449417471885681) (1321031472.0, 0.7889317274093628) (1321031473.0, 0.609708845615387) (1321031474.0, 0.9120397567749023) (1321031475.0, 0.2386845201253891)] ---end output--- Creating the index after the data has been inserted produces the same behavior (no data is returned while the index exists). Any suggestions would be greatly appreciated. I've reproduced with a number of different index configurations. If I change the column type to Float64, then the index works as expected. BEFORE: Initial index: verbose has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 remove indexhas_index= Falseuse_index= frozenset([])where= 5readWhere= 5 re-add index (non-verbose) has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 remove againhas_index= Falseuse_index= frozenset([])where= 5readWhere= 5 re-add index (with flush) has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 re-add index (full) has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 re-add index (ultralight) has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 re-add index (o=0) has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 re-add index (o=9) has_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 re-indexhas_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 also index valuehas_index= True use_index= frozenset(['Awhen'])where= 0readWhere= 0 AFTER: Initial index: verbose has_index= True use_index= frozenset(['Awhen'])where= 5readWhere= 5 remove indexhas_index= Falseuse_index= frozenset([])where= 5readWhere= 5 re-add index (non-verbose) has_index= True use_index= frozenset(['Awhen'])where= 5readWhere= 5 remove againhas_index= Falseuse_index= frozenset([])where= 5readWhere= 5 re-add index (with flush) has_index= True use_index= frozenset(['Awhen'])where= 5readWhere= 5 re-add index (full) has_index= True use_index= frozenset(['Awhen'])where= 5readWhere= 5 re-add index (ultralight) has_index= True use_index= frozenset(['Awhen'])where= 5readWhere= 5 re-add index (o=0) has_index= True use_index= frozenset(['Awhen'])where= 5readWhere= 5 re-add index (o=9) has_index= True use_index=
Re: [Pytables-users] new user: advice on how to structure files
On Thu, Nov 17, 2011 at 6:20 PM, Andre' Walker-Loud walksl...@gmail.comwrote: Hi All, I just stumbled upon pytables, and have been playing around with converting my data files into hdf5 using pytables. I am wondering about strategies to create data files. I have created a file with the following group structure root corr_name src_type snk_type config data the data = 1 x 48 array of floats config = a set which is to be averaged over, in this particular case, 1000, 1010, ..., 20100 (1911 in all) the other three groups are just collect metadata describing the data below, and provide a natural way to build matrices of data files, allowing the user (my collaborators) to pick and chose various combinations of srcs and snks (instead of taking them all). This seems pretty reasonable. You could also try to rearrange your data to have a shallower hierarchy and have everything stored in Tables with src_type, corr_name, etc columns that you then search through. The reason for doing this is to avoid the overhead of the hierarchy (not only the space on disk but also speed of traversal). But what you have definitely works. This structure arises naturally (to me) from the type of data files I am storing/analyzing, but I imagine there are better ways to build the file (also, when I make my file this way, it is only 105 MB, but it causes HDFViewer to fail to open with an OutOfMemory error). I would appreciate any advice on how to do this better. I use ViTables to view much larger files than that. I would recommend checking it out. Be Well Anthony Below is the relevant python script which creates my file. Thanks, Andre import tables as pyt import personal_calls_to_numpy as pc import os corrs = ['name1','name2',...] dirs = [] for no in range(1000,20101,10): dirs.append('c'+str(no)) #dirs.append(str(no)) #this gives NaturalNaming error f = pyt.openFile('nplqcd_iso_old.h5','w') root = f.root for corr in corrs: cg = f.createGroup(root,corr.split('_')[-1]) src = f.createGroup(cg,'Src_GaussSmeared') for s in ['S','P']: if os.path.exists('concatonated/'+corr+'_'+tag+'_'+s+'.dat'): print('adding '+corr+'_'+tag+'_'+s+'.dat') h,c = pc.read_corr('concatonated/'+corr+'_'+tag+'_'+s+'.dat') Ncfg = int(h[0]); NT = int(h[1]) snk = f.createGroup(src,'Snk_'+s) #data = f.createArray(snk,'real',c) for cfg in range(Ncfg): gc = f.createGroup(snk,dirs[cfg]) data = f.createArray(gc,'real',c[cfg]) else: print('concatonated/'+corr+'_'+tag+'_'+s+'.dat DOES NOT EXIST') f.close() -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Slow data retrieve
Hello Matthew, I think code can definitely be faster. I am going to ask a series of possibly silly questions, so bear with me? This will help pin down what the problem points are. 1) Have you profiled the code? (I use line_profilerhttp://packages.python.org/line_profiler/.) Which lines are taking up the most execution time? 2) Type conversion is expensive. Your table description is 7 string columns, but it seems that the last 6 are numerical. You would save *a lot* on time and space if you actually stored these a numbers. (It is staggering really.) 3) Your start and end variable are being converted to int(), once again this conversion is likely not required. 4) Have you considered using NumPy? It looks like you are storing things in lists and then operating on those. NumPy arrays will be much much much faster. (PyTables supports NumPy very well.) Hope this helps, Be Well Anthony On Fri, Nov 18, 2011 at 7:15 AM, PyTables Org pytab...@googlemail.comwrote: Forwarding to the list. ~Josh. Begin forwarded message: *From: *pytables-users-boun...@lists.sourceforge.net *Date: *November 18, 2011 2:02:15 PM GMT+01:00 *To: *pytables-users-ow...@lists.sourceforge.net *Subject: **Auto-discard notification* The attached message has been automatically discarded. *From: *Matthew Care matthew.c...@gmail.com *Date: *November 18, 2011 2:02:04 PM GMT+01:00 *To: *PyTables UserList pytables-users@lists.sourceforge.net *Subject: **Slow data retrieve* Hi All, I have a very simple data structure for storing genome data, basically for each table (chromosome) I have the following data structure: class BaseInfo(IsDescription): base = StringCol(1) phastMammal = StringCol(6) phastPrimate = StringCol(6) phastVertebrate = StringCol(6) phyloMammal = StringCol(6) phyloPrimate = StringCol(6) phyloVertebrate = StringCol(6) Each table's chunk size is set to the length of the chromosome. Thus for each base in the genome I have 7 different bits of information, an example of this is: A 0.034 0.002 0.002 0.836 1.072 1.072 The total structure of my h5 file looks like this: I:\#Databases\h5Databases\genomeAnnotations.h5 (File) 'Genome Annotations Database' Last modif.: 'Fri Jan 28 17:29:31 2011' Object Tree: / (RootGroup) 'Genome Annotations Database' /Human36release (Group) 'Human 36 Release (hg18)' /Human36release/baseAndConservation (Group) 'Folder for base and conservation info' /Human36release/baseAndConservation/chr1 (Table(247249720,), shuffle, lzo(1)) 'Table for chr1' /Human36release/baseAndConservation/chr10 (Table(135374738,), shuffle, lzo(1)) 'Table for chr10' /Human36release/baseAndConservation/chr11 (Table(134452385,), shuffle, lzo(1)) 'Table for chr11' /Human36release/baseAndConservation/chr12 (Table(132349535,), shuffle, lzo(1)) 'Table for chr12' /Human36release/baseAndConservation/chr13 (Table(114142981,), shuffle, lzo(1)) 'Table for chr13' /Human36release/baseAndConservation/chr14 (Table(106368586,), shuffle, lzo(1)) 'Table for chr14' /Human36release/baseAndConservation/chr15 (Table(100338916,), shuffle, lzo(1)) 'Table for chr15' /Human36release/baseAndConservation/chr16 (Table(88827255,), shuffle, lzo(1)) 'Table for chr16' /Human36release/baseAndConservation/chr17 (Table(78774743,), shuffle, lzo(1)) 'Table for chr17' /Human36release/baseAndConservation/chr18 (Table(76117154,), shuffle, lzo(1)) 'Table for chr18' /Human36release/baseAndConservation/chr19 (Table(63811652,), shuffle, lzo(1)) 'Table for chr19' /Human36release/baseAndConservation/chr2 (Table(242951150,), shuffle, lzo(1)) 'Table for chr2' /Human36release/baseAndConservation/chr20 (Table(62435965,), shuffle, lzo(1)) 'Table for chr20' /Human36release/baseAndConservation/chr21 (Table(46944324,), shuffle, lzo(1)) 'Table for chr21' /Human36release/baseAndConservation/chr22 (Table(49691433,), shuffle, lzo(1)) 'Table for chr22' /Human36release/baseAndConservation/chr3 (Table(199501828,), shuffle, lzo(1)) 'Table for chr3' /Human36release/baseAndConservation/chr4 (Table(191273064,), shuffle, lzo(1)) 'Table for chr4' /Human36release/baseAndConservation/chr5 (Table(180857867,), shuffle, lzo(1)) 'Table for chr5' /Human36release/baseAndConservation/chr6 (Table(17083,), shuffle, lzo(1)) 'Table for chr6' /Human36release/baseAndConservation/chr7 (Table(158821425,), shuffle, lzo(1)) 'Table for chr7' /Human36release/baseAndConservation/chr8 (Table(146274827,), shuffle, lzo(1)) 'Table for chr8' /Human36release/baseAndConservation/chr9 (Table(140273253,), shuffle, lzo(1)) 'Table for chr9' /Human36release/baseAndConservation/chrX (Table(154913755,), shuffle, lzo(1)) 'Table for chrX' /Human36release/baseAndConservation/chrY (Table(57772955,), shuffle, lzo(1)) 'Table for chrY' As you can see this is obviously quite a large h5 file (roughly 35Gb). The problem is that I don't think I'm retrieving data from this as