Re: [Pytables-users] Row.append() performance

2013-04-15 Thread Shyam Parimal Katti
Hello Anthony,


Thank you for your suggestions. When I mentioned that I am reading the
data from database, I meant a DB2 database, not a HDF5 database/file.


I followed your suggestions, so the code looks as follows:


def createHDF5File():

 h5File= tables.openFile(file name, mode=a)

  table.createTable(h5File.root, Contact, Contact,
Contact, expectedrows=700)

 .


def getDataAndWriteHDF5(table):

 databaseConn= pyodbc.connect(connection string, password)

 cursor= databaseConn.cursor()

 cursor.execute(SQL Query)

  resultSet= np.fromiter(( tuple(row) for row in cursor),
dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200),
('email', numpy.str_, 180), ('phone', numpy.str_, 256)])

   table.append(resultSet)



Error message: MemoryError: cannot allocate array memory.


I am setting the `expectedrows` parameter when creating the table in
HDF5 file, and yet encounter the error above. Looking forward to
suggestions.




 Hello Anthony,

 Thank you for replying back with suggestions.

 In response to your suggestions, I am *not reading the data from a file
 in the first step, but instead a database*.


Hello Shyam,

To put too fine a point on it, hdf5 databases are files.  And reading from
any kind of file incurs the same disk read overhead.


  I did try out your 1st suggestion of doing a table.append(list of
 tuples), which took a little more than the executed time I got with the
 original code. Can you please guide me in how to chunk the data (that I got
 from database and stored as a list of tuples in Python) ?


Ahh, so you should not be using list of tuples.  These are Pythonic types
and conversion between HDF5 types and Python types is what is slowing you
down.  You should be passing a numpy structured array into append().  Numpy
types are very similar (and often exactly the same as) HDF5 types.  For
large, continuous, structured data you want to avoid the Python interpreter
as much as possible.  Use Python here as the glue code to compose a series
of fast operations using the APIs exposed by numpy, pytables, etc.

Be Well
Anthony



On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti spk...@nyu.edu wrote:

 Hello Anthony,

 Thank you for replying back with suggestions.

 In response to your suggestions, I am *not reading the data from a file
 in the first step, but instead a database*. I did try out your 1st
 suggestion of doing a table.append(list of tuples), which took a little
 more than the executed time I got with the original code. Can you please
 guide me in how to chunk the data (that I got from database and stored as a
 list of tuples in Python) ?


 Thanks,
 Shyam

 Hi Shyam,

 The pattern that you are using to write to a table is basically one for
 writing Python data to HDF5.  However, your data is already in a machine /
 HDF5 native format.  Thus what you are doing here is an excessive amount of
 work:  read data from file - convert to Python data structures - covert
 back to HDF5 data structures - write to file.

 When reading from a table you get back a numpy structured array (look them
 up on the numpy website).

 Then instead of using rows to write back the data, just use Table.append()
 [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
 data in this case is too large to fit into memory, so you may have to spit
 it up into chunks or use the new iterators which are in the development
 branch.)

 Additionally, if all you are doing is copying a table wholesale, you should
 use the Table.copy(). [2]  Or if you only want to copy some subset based on
 a conditional you provide, use whereAppend() [3].

 Finally, if you want to do math or evaluate expressions on one table to
 create a new table, use the Expr class [4].

 All of these will be way faster than what you are doing right now.

 Be Well
 Anthony

 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
 4. http://pytables.github.io/usersguide/libref/expr_class.html




 On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti spk...@nyu.eduwrote:

 Hello,

 I am writing a lot of data(close to 122GB ) to a hdf5 file using
 PyTables. The execution time for writing the query result to the file is
 close to 10 hours, which includes querying the database and then writing to
 the file. When I timed the entire execution, I found that it takes as much
 time to get the data from the database as it takes to write to the hdf5
 file. Here is the small snippet(P.S: the execution time noted below is not
 for 122GB data, but a small subset close to 10GB):

 class ContactClass(table.IsDescription):
 name= tb.StringCol(4200)
 address= tb.StringCol(4200)
 emailAddr= tb.StringCol(180)
 phone= tb.StringCol(256)

 

Re: [Pytables-users] Row.append() performance

2013-04-15 Thread Anthony Scopatz
Hello Shyam,

Can you please post the full traceback?  In any event, I am fairly certain
that this error is coming from the np.fromiter step.  The problem here is
that you are trying to read yur entire SQL query into a single numpy array
in memory.  This is impossible because you don't have enough RAM.
Therefore, you are going to need to read and write in chunks.  Something
like the following:

def getDataAndWriteHDF5(table):
databaseConn= pyodbc.connect(connection string, password)
cursor= databaseConn.cursor()
cursor.execute(SQL Query)
dt = np.dtype([('name', numpy.str_, 180), ('address', numpy.str_,
4200),
   ('email', numpy.str_, 180), ('phone', numpy.str_, 256)])
citer = iter(cursor)
chunksize = 4096  # This is just a guess, other values might work better
crange = range(chunksize)
while True:
resultSet = np.fromiter((tuple(row) for i, row in zip(crange,
citer)), dtype=dt)
table.append(resultSet)
if len(resultSet)  chunksize:
break

You may want to tweak some things, but that is the basic strategy.

Be Well
Anthony


On Mon, Apr 15, 2013 at 10:16 PM, Shyam Parimal Katti spk...@nyu.eduwrote:

 Hello Anthony,


 Thank you for your suggestions. When I mentioned that I am reading the data 
 from database, I meant a DB2 database, not a HDF5 database/file.


 I followed your suggestions, so the code looks as follows:


 def createHDF5File():

  h5File= tables.openFile(file name, mode=a)

   table.createTable(h5File.root, Contact, Contact, Contact, 
 expectedrows=700)

  .


 def getDataAndWriteHDF5(table):

  databaseConn= pyodbc.connect(connection string, password)

  cursor= databaseConn.cursor()

  cursor.execute(SQL Query)

   resultSet= np.fromiter(( tuple(row) for row in cursor), 
 dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', 
 numpy.str_, 180), ('phone', numpy.str_, 256)])

table.append(resultSet)



 Error message: MemoryError: cannot allocate array memory.



 I am setting the `expectedrows` parameter when creating the table in HDF5 
 file, and yet encounter the error above. Looking forward to suggestions.





  Hello Anthony,
 
  Thank you for replying back with suggestions.
 
  In response to your suggestions, I am *not reading the data from a file
  in the first step, but instead a database*.
 

 Hello Shyam,

 To put too fine a point on it, hdf5 databases are files.  And reading from
 any kind of file incurs the same disk read overhead.


   I did try out your 1st suggestion of doing a table.append(list of
  tuples), which took a little more than the executed time I got with the
  original code. Can you please guide me in how to chunk the data (that I got
  from database and stored as a list of tuples in Python) ?
 

 Ahh, so you should not be using list of tuples.  These are Pythonic types
 and conversion between HDF5 types and Python types is what is slowing you
 down.  You should be passing a numpy structured array into append().  Numpy
 types are very similar (and often exactly the same as) HDF5 types.  For
 large, continuous, structured data you want to avoid the Python interpreter
 as much as possible.  Use Python here as the glue code to compose a series
 of fast operations using the APIs exposed by numpy, pytables, etc.

 Be Well
 Anthony



 On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti spk...@nyu.eduwrote:

 Hello Anthony,

 Thank you for replying back with suggestions.

 In response to your suggestions, I am *not reading the data from a file
 in the first step, but instead a database*. I did try out your 1st
 suggestion of doing a table.append(list of tuples), which took a little
 more than the executed time I got with the original code. Can you please
 guide me in how to chunk the data (that I got from database and stored as a
 list of tuples in Python) ?


 Thanks,
 Shyam


 Hi Shyam,

 The pattern that you are using to write to a table is basically one for
 writing Python data to HDF5.  However, your data is already in a machine /
 HDF5 native format.  Thus what you are doing here is an excessive amount of
 work:  read data from file - convert to Python data structures - covert
 back to HDF5 data structures - write to file.

 When reading from a table you get back a numpy structured array (look them
 up on the numpy website).

 Then instead of using rows to write back the data, just use Table.append()
 [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
 data in this case is too large to fit into memory, so you may have to spit
 it up into chunks or use the new iterators which are in the development
 branch.)

 Additionally, if all you are doing is copying a table wholesale, you should
 use the Table.copy(). [2]  Or if you only want to copy some subset based on
 a conditional you provide, use whereAppend() [3].

 Finally, if you want to do math or evaluate expressions on 

Re: [Pytables-users] Row.append() performance

2013-04-11 Thread Shyam Parimal Katti
Hello Anthony,

Thank you for replying back with suggestions.

In response to your suggestions, I am *not reading the data from a file in
the first step, but instead a database*. I did try out your 1st suggestion
of doing a table.append(list of tuples), which took a little more than the
executed time I got with the original code. Can you please guide me in how
to chunk the data (that I got from database and stored as a list of tuples
in Python) ?


Thanks,
Shyam

Hi Shyam,

The pattern that you are using to write to a table is basically one for
writing Python data to HDF5.  However, your data is already in a machine /
HDF5 native format.  Thus what you are doing here is an excessive amount of
work:  read data from file - convert to Python data structures - covert
back to HDF5 data structures - write to file.

When reading from a table you get back a numpy structured array (look them
up on the numpy website).

Then instead of using rows to write back the data, just use Table.append()
[1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
data in this case is too large to fit into memory, so you may have to spit
it up into chunks or use the new iterators which are in the development
branch.)

Additionally, if all you are doing is copying a table wholesale, you should
use the Table.copy(). [2]  Or if you only want to copy some subset based on
a conditional you provide, use whereAppend() [3].

Finally, if you want to do math or evaluate expressions on one table to
create a new table, use the Expr class [4].

All of these will be way faster than what you are doing right now.

Be Well
Anthony

1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
4. http://pytables.github.io/usersguide/libref/expr_class.html




On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti spk...@nyu.eduwrote:

 Hello,

 I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
 The execution time for writing the query result to the file is close to 10
 hours, which includes querying the database and then writing to the file.
 When I timed the entire execution, I found that it takes as much time to
 get the data from the database as it takes to write to the hdf5 file. Here
 is the small snippet(P.S: the execution time noted below is not for 122GB
 data, but a small subset close to 10GB):

 class ContactClass(table.IsDescription):
 name= tb.StringCol(4200)
 address= tb.StringCol(4200)
 emailAddr= tb.StringCol(180)
 phone= tb.StringCol(256)

 h5File= table.openFile(file name, mode=a, title= Contacts)
 t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
 filters=table.Filters(5, 'blosc'), expectedrows=77806938)

 resultSet= get data from database
 currRow= t.row
 print(Before appending data: %s % str(datetime.now()))
 for (attributes ..) in resultSet:
  currRow['name']= attribute[0]
  currRow['address']= attribute[1]
  currRow['emailAddr']= attribute[2]
  currRow['phone']= attribute[3]
  currRow.append()
 print(After done appending: %s % str(datetime.now()))
 t.flush()
 print(After done flushing: %s % str(datetime.now()))

 .. gives me:
 *Before appending data  2013-04-11 10:42:39.903713  *
 *After done appending: 2013-04-11 11:04:10.002712*
 *After done flushing: 2013-04-11 11:05:50.059893*
 *
 *
 it seems like append() takes a lot of time. Any suggestions on how to
 improve this?

 Thanks,
 Shyam


--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users