Re: [Pytables-users] adding column index hides data(?)

2011-11-18 Thread Josh Moore

On Nov 17, 2011, at 10:35 PM, Alan Marchiori wrote:

 Hello,

Hi Alan,

 I am attempting to use PyTables (v2.3.1) to store timestamped data and
 things were going well until I added a column index.  While the column
 is indexed no data is returned from a table.where call!
 
 This behavior is demonstrated with the following test code:
 ---begin test.py---
 import tables
 import random
 
 class Descr(tables.IsDescription):
when = tables.Time64Col(pos = 1)
value = tables.Float32Col(pos = 2)
 
 h5f = tables.openFile('/tmp/tmp.h5', 'w')
 tbl = h5f.createTable('/', 'test', Descr)
 
 tbl.cols.when.createIndex(_verbose = True)
 
 t = 1321031471.0  # 11/11/11 11:11:11
 tbl.append([(t + i, random.random()) for i in range(1000)])
 tbl.flush()
 
 def query(s):
print 'is_index =', tbl.cols.when.is_indexed
print [(row['when'], row['value']) for row in tbl.where(s)]
print tbl.readWhere(wherestr)
 
 wherestr = '(when = %d)  (when  %d)'%(t, t+5)
 query(wherestr)
 tbl.cols.when.removeIndex()
 query(wherestr)
 
 h5f.close()
 ---end test.py---
 
 This creates the table for storing time/value pairs, inserts some
 synthetic data, and then checks to see if there is data in the table.
 When the table is created there is an index added to the 'where'
 column.  The first query returns no data (which is incorrect).  Then
 the column index is removed (via table.removeIndex) and the query is
 repeated.  This time 5 results are returned as expected.  The data is
 clearly there however the index is somehow breaking the where logic.
 Here is the output I get:
 
 ---begin output---
 is_index = True
 []
 []
 is_index = False
 [(1321031471.0, 0.6449417471885681), (1321031472.0,
 0.7889317274093628), (1321031473.0, 0.609708845615387), (1321031474.0,
 0.9120397567749023), (1321031475.0, 0.2386845201253891)]
 [(1321031471.0, 0.6449417471885681) (1321031472.0, 0.7889317274093628)
 (1321031473.0, 0.609708845615387) (1321031474.0, 0.9120397567749023)
 (1321031475.0, 0.2386845201253891)]
 ---end output---
 
 Creating the index after the data has been inserted produces the same
 behavior (no data is returned while the index exists).  Any
 suggestions would be greatly appreciated.

I've reproduced with a number of different index configurations. If I change 
the column type to Float64, then the index works as expected.

BEFORE:
Initial index: verbose  has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
remove indexhas_index= Falseuse_index=  
   frozenset([])where= 5readWhere= 5
re-add index (non-verbose)  has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
remove againhas_index= Falseuse_index=  
   frozenset([])where= 5readWhere= 5
re-add index (with flush)   has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
re-add index (full) has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
re-add index (ultralight)   has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
re-add index (o=0)  has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
re-add index (o=9)  has_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
re-indexhas_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0
also index valuehas_index= True use_index=  
frozenset(['Awhen'])where= 0readWhere= 0


AFTER:
Initial index: verbose  has_index= True use_index=  
frozenset(['Awhen'])where= 5readWhere= 5
remove indexhas_index= Falseuse_index=  
   frozenset([])where= 5readWhere= 5
re-add index (non-verbose)  has_index= True use_index=  
frozenset(['Awhen'])where= 5readWhere= 5
remove againhas_index= Falseuse_index=  
   frozenset([])where= 5readWhere= 5
re-add index (with flush)   has_index= True use_index=  
frozenset(['Awhen'])where= 5readWhere= 5
re-add index (full) has_index= True use_index=  
frozenset(['Awhen'])where= 5readWhere= 5
re-add index (ultralight)   has_index= True use_index=  
frozenset(['Awhen'])where= 5readWhere= 5
re-add index (o=0)  has_index= True use_index=  
frozenset(['Awhen'])where= 5readWhere= 5
re-add index (o=9)  has_index= True use_index=  

Re: [Pytables-users] new user: advice on how to structure files

2011-11-18 Thread Anthony Scopatz
On Thu, Nov 17, 2011 at 6:20 PM, Andre' Walker-Loud walksl...@gmail.comwrote:

 Hi All,

 I just stumbled upon pytables, and have been playing around with
 converting my data files into hdf5 using pytables.  I am wondering about
 strategies to create data files.

 I have created a file with the following group structure

 root
  corr_name
src_type
  snk_type
config
  data

 the data = 1 x 48 array of floats
 config = a set which is to be averaged over, in this particular case,
 1000, 1010, ..., 20100 (1911 in all)
 the other three groups are just collect metadata describing the data
 below, and provide a natural way to build matrices of data files, allowing
 the user (my collaborators) to pick and chose various combinations of srcs
 and snks (instead of taking them all).


This seems pretty reasonable.

You could also try to rearrange your data to have a shallower hierarchy and
have everything stored in Tables with src_type, corr_name, etc columns that
you then search through.  The reason for doing this is to avoid the
overhead of the hierarchy (not only the space on disk but also speed of
traversal).  But what you have definitely works.



 This structure arises naturally (to me) from the type of data files I am
 storing/analyzing, but I imagine there are better ways to build the file
 (also, when I make my file this way, it is only 105 MB, but it causes
 HDFViewer to fail to open with an OutOfMemory error).  I would appreciate
 any advice on how to do this better.


I use ViTables to view much larger files than that.  I would recommend
checking it out.

Be Well
Anthony



 Below is the relevant python script which creates my file.

 Thanks,

 Andre

 import tables as pyt
 import personal_calls_to_numpy as pc
 import os

 corrs = ['name1','name2',...]
 dirs = []
 for no in range(1000,20101,10):
dirs.append('c'+str(no))
#dirs.append(str(no))  #this gives NaturalNaming error

 f = pyt.openFile('nplqcd_iso_old.h5','w')
 root = f.root
 for corr in corrs:
   cg = f.createGroup(root,corr.split('_')[-1])
   src = f.createGroup(cg,'Src_GaussSmeared')
   for s in ['S','P']:
   if os.path.exists('concatonated/'+corr+'_'+tag+'_'+s+'.dat'):
   print('adding '+corr+'_'+tag+'_'+s+'.dat')
   h,c = pc.read_corr('concatonated/'+corr+'_'+tag+'_'+s+'.dat')
   Ncfg = int(h[0]); NT = int(h[1])
   snk = f.createGroup(src,'Snk_'+s)
   #data = f.createArray(snk,'real',c)
   for cfg in range(Ncfg):
   gc = f.createGroup(snk,dirs[cfg])
   data = f.createArray(gc,'real',c[cfg])
   else:
   print('concatonated/'+corr+'_'+tag+'_'+s+'.dat DOES NOT EXIST')
 f.close()



 --
 All the data continuously generated in your IT infrastructure
 contains a definitive record of customers, application performance,
 security threats, fraudulent activity, and more. Splunk takes this
 data and makes sense of it. IT sense. And common sense.
 http://p.sf.net/sfu/splunk-novd2d
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Slow data retrieve

2011-11-18 Thread Anthony Scopatz
Hello Matthew,

I think code can definitely be faster.  I am going to ask a series of
possibly silly questions, so bear with me?  This will help pin down what
the problem points are.

1) Have you profiled the code?  (I use
line_profilerhttp://packages.python.org/line_profiler/.)
 Which lines are taking up the most execution time?

2) Type conversion is expensive.  Your table description is 7 string
columns, but it seems that the last 6 are numerical.  You would save *a
lot* on time and space if you actually stored these a numbers.  (It is
staggering really.)

3) Your start and end variable are being converted to int(), once again
this conversion is likely not required.

4) Have you considered using NumPy?  It looks like you are storing things
in lists and then operating on those.  NumPy arrays will be much much much
faster.  (PyTables supports NumPy very well.)

Hope this helps,
Be Well
Anthony

On Fri, Nov 18, 2011 at 7:15 AM, PyTables Org pytab...@googlemail.comwrote:

 Forwarding to the list. ~Josh.

 Begin forwarded message:

 *From: *pytables-users-boun...@lists.sourceforge.net
 *Date: *November 18, 2011 2:02:15 PM GMT+01:00
 *To: *pytables-users-ow...@lists.sourceforge.net
 *Subject: **Auto-discard notification*

 The attached message has been automatically discarded.
 *From: *Matthew Care matthew.c...@gmail.com
 *Date: *November 18, 2011 2:02:04 PM GMT+01:00
 *To: *PyTables UserList pytables-users@lists.sourceforge.net
 *Subject: **Slow data retrieve*


 Hi All,

 I have a very simple data structure for storing genome data, basically for
 each table (chromosome) I have the following data structure:

 class BaseInfo(IsDescription):

 base = StringCol(1)
 phastMammal = StringCol(6)
 phastPrimate = StringCol(6)
 phastVertebrate = StringCol(6)
 phyloMammal = StringCol(6)
 phyloPrimate = StringCol(6)
 phyloVertebrate = StringCol(6)

 Each table's chunk size is set to the length of the chromosome.

 Thus for each base in the genome I have 7 different bits of information,
 an example of this is:
 A 0.034 0.002 0.002 0.836 1.072 1.072

 The total structure of my h5 file looks like this:

 I:\#Databases\h5Databases\genomeAnnotations.h5 (File) 'Genome Annotations
 Database'
 Last modif.: 'Fri Jan 28 17:29:31 2011'
 Object Tree:
 / (RootGroup) 'Genome Annotations Database'
 /Human36release (Group) 'Human 36 Release (hg18)'
 /Human36release/baseAndConservation (Group) 'Folder for base and
 conservation info'
 /Human36release/baseAndConservation/chr1 (Table(247249720,), shuffle,
 lzo(1)) 'Table for chr1'
 /Human36release/baseAndConservation/chr10 (Table(135374738,), shuffle,
 lzo(1)) 'Table for chr10'
 /Human36release/baseAndConservation/chr11 (Table(134452385,), shuffle,
 lzo(1)) 'Table for chr11'
 /Human36release/baseAndConservation/chr12 (Table(132349535,), shuffle,
 lzo(1)) 'Table for chr12'
 /Human36release/baseAndConservation/chr13 (Table(114142981,), shuffle,
 lzo(1)) 'Table for chr13'
 /Human36release/baseAndConservation/chr14 (Table(106368586,), shuffle,
 lzo(1)) 'Table for chr14'
 /Human36release/baseAndConservation/chr15 (Table(100338916,), shuffle,
 lzo(1)) 'Table for chr15'
 /Human36release/baseAndConservation/chr16 (Table(88827255,), shuffle,
 lzo(1)) 'Table for chr16'
 /Human36release/baseAndConservation/chr17 (Table(78774743,), shuffle,
 lzo(1)) 'Table for chr17'
 /Human36release/baseAndConservation/chr18 (Table(76117154,), shuffle,
 lzo(1)) 'Table for chr18'
 /Human36release/baseAndConservation/chr19 (Table(63811652,), shuffle,
 lzo(1)) 'Table for chr19'
 /Human36release/baseAndConservation/chr2 (Table(242951150,), shuffle,
 lzo(1)) 'Table for chr2'
 /Human36release/baseAndConservation/chr20 (Table(62435965,), shuffle,
 lzo(1)) 'Table for chr20'
 /Human36release/baseAndConservation/chr21 (Table(46944324,), shuffle,
 lzo(1)) 'Table for chr21'
 /Human36release/baseAndConservation/chr22 (Table(49691433,), shuffle,
 lzo(1)) 'Table for chr22'
 /Human36release/baseAndConservation/chr3 (Table(199501828,), shuffle,
 lzo(1)) 'Table for chr3'
 /Human36release/baseAndConservation/chr4 (Table(191273064,), shuffle,
 lzo(1)) 'Table for chr4'
 /Human36release/baseAndConservation/chr5 (Table(180857867,), shuffle,
 lzo(1)) 'Table for chr5'
 /Human36release/baseAndConservation/chr6 (Table(17083,), shuffle,
 lzo(1)) 'Table for chr6'
 /Human36release/baseAndConservation/chr7 (Table(158821425,), shuffle,
 lzo(1)) 'Table for chr7'
 /Human36release/baseAndConservation/chr8 (Table(146274827,), shuffle,
 lzo(1)) 'Table for chr8'
 /Human36release/baseAndConservation/chr9 (Table(140273253,), shuffle,
 lzo(1)) 'Table for chr9'
 /Human36release/baseAndConservation/chrX (Table(154913755,), shuffle,
 lzo(1)) 'Table for chrX'
 /Human36release/baseAndConservation/chrY (Table(57772955,), shuffle,
 lzo(1)) 'Table for chrY'

 As you can see this is obviously quite a large h5 file (roughly 35Gb).

 The problem is that I don't think I'm retrieving data from this as