Re: [Pytables-users] Pytables file reading

2012-08-05 Thread Antonio Valentino
Hi Juan Manuel,

Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto:
 Hello all,
 
 I´m managing a file close to 26 Gb size. It´s main structure is  a table
 with a bit more than 8 million rows. The table is made by four columns, the
 first two columns store names, the 3rd one has a 53 items array in each
 cell and the last column has a 133x6 matrix in each cell.
 I use to work with a Linux workstation with 24 Gb. My usual way of working
 with the file is to retrieve, from each cell in the 4th column of the
 table, the same row from the 133x6 matrix.
 I store the information in a bumpy array with shape 8e6x6. In this process
 I almost use the whole workstation memory.
 Is there anyway to optimize the memory usage?

I'm not sure to understand.
My impression is that you do not actually need to have the entire 8e6x6
matrix in memory at once, is it correct?

In that case you could simply try to load less data using something like

data = table.read(0, 5e7, field='name of the 4-th field')
process(data)
data = table.read(5e7, 1e8,  field='name of the 4-th field')
process(data)

See also [1] and [2].

Does it make sense for you?


[1] http://pytables.github.com/usersguide/libref.html#table-methods-reading
[2] http://pytables.github.com/usersguide/libref.html#tables.Table.read

 If not, I have been thinking about splitting the file.
 
 Thank you,
 
 Juanma


cheers

-- 
Antonio Valentino

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Pytables file reading

2012-08-05 Thread Juan Manuel Vázquez Tovar
Hi Antonio,

You are right, I don´t need to load the entire table into memory.
The fourth column has multidimensional cells and when I read a single row
from every cell in the column, I almost fill the workstation memory.
I didn´t expect that process to use so much memory, but the fact is that it
uses it.
May be I didn´t explain very well last time.

Thank you,

Juanma

2012/8/5 Antonio Valentino antonio.valent...@tiscali.it

 Hi Juan Manuel,

 Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto:
  Hello all,
 
  I´m managing a file close to 26 Gb size. It´s main structure is  a table
  with a bit more than 8 million rows. The table is made by four columns,
 the
  first two columns store names, the 3rd one has a 53 items array in each
  cell and the last column has a 133x6 matrix in each cell.
  I use to work with a Linux workstation with 24 Gb. My usual way of
 working
  with the file is to retrieve, from each cell in the 4th column of the
  table, the same row from the 133x6 matrix.
  I store the information in a bumpy array with shape 8e6x6. In this
 process
  I almost use the whole workstation memory.
  Is there anyway to optimize the memory usage?

 I'm not sure to understand.
 My impression is that you do not actually need to have the entire 8e6x6
 matrix in memory at once, is it correct?

 In that case you could simply try to load less data using something like

 data = table.read(0, 5e7, field='name of the 4-th field')
 process(data)
 data = table.read(5e7, 1e8,  field='name of the 4-th field')
 process(data)

 See also [1] and [2].

 Does it make sense for you?


 [1]
 http://pytables.github.com/usersguide/libref.html#table-methods-reading
 [2] http://pytables.github.com/usersguide/libref.html#tables.Table.read

  If not, I have been thinking about splitting the file.
 
  Thank you,
 
  Juanma


 cheers

 --
 Antonio Valentino


 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Pytables file reading

2012-08-05 Thread Antonio Valentino
Hi Juan Manuel,

Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto:
 Hi Antonio,
 
 You are right, I don´t need to load the entire table into memory.
 The fourth column has multidimensional cells and when I read a single row
 from every cell in the column, I almost fill the workstation memory.
 I didn´t expect that process to use so much memory, but the fact is that it
 uses it.
 May be I didn´t explain very well last time.
 
 Thank you,
 
 Juanma
 

Sorry, still don't understand.
Can you please post a short code snipped that shows how exactly do you
read data into your program?

My impression is that somewhere you use some instruction that triggers
loading of unnecessary data into memory.



 2012/8/5 Antonio Valentino antonio.valent...@tiscali.it
 
 Hi Juan Manuel,

 Il 04/08/2012 01:55, Juan Manuel Vázquez Tovar ha scritto:
 Hello all,

 I´m managing a file close to 26 Gb size. It´s main structure is  a table
 with a bit more than 8 million rows. The table is made by four columns,
 the
 first two columns store names, the 3rd one has a 53 items array in each
 cell and the last column has a 133x6 matrix in each cell.
 I use to work with a Linux workstation with 24 Gb. My usual way of
 working
 with the file is to retrieve, from each cell in the 4th column of the
 table, the same row from the 133x6 matrix.
 I store the information in a bumpy array with shape 8e6x6. In this
 process
 I almost use the whole workstation memory.
 Is there anyway to optimize the memory usage?

 I'm not sure to understand.
 My impression is that you do not actually need to have the entire 8e6x6
 matrix in memory at once, is it correct?

 In that case you could simply try to load less data using something like

 data = table.read(0, 5e7, field='name of the 4-th field')
 process(data)
 data = table.read(5e7, 1e8,  field='name of the 4-th field')
 process(data)

 See also [1] and [2].

 Does it make sense for you?


 [1]
 http://pytables.github.com/usersguide/libref.html#table-methods-reading
 [2] http://pytables.github.com/usersguide/libref.html#tables.Table.read

 If not, I have been thinking about splitting the file.

 Thank you,

 Juanma


 cheers

 --
 Antonio Valentino


-- 
Antonio Valentino

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Pytables file reading

2012-08-05 Thread Antonio Valentino
Hi Juan Manuel,

Il 05/08/2012 22:52, Juan Manuel Vázquez Tovar ha scritto:
 Hi Antonio,
 
 This is the piece of code I use to read the part of the table I need:
 
 data = [case[´loads´][i] for case in table]
 
 where i is the index of the row that I need to read from the matrix (133x6)
 stored in each cell of the column loads.
 
 Juanma
 

that looks perfectly fine to me.
No idea about what could be the issue :/

You can perfform patrial reads using Table.iterrows:

data = [case[´loads´][i] for case in table.iterrows(start, stop)]

Please also consider that using a single np.array with 1e8 rows instead
of a list of arrays will allows you to save the memory overhead of 1e8
array objects.
Considering that 6 doubles are 48 bytes while an empty np.array takes 80
bytes

In [64]: sys.getsizeof(np.zeros((0,)))
Out[64]: 80

you should be able to reduce the memory footprint by far more than an half.


cheers


 2012/8/5 Antonio Valentino antonio.valent...@tiscali.it
 
 Hi Juan Manuel,

 Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto:
 Hi Antonio,

 You are right, I don´t need to load the entire table into memory.
 The fourth column has multidimensional cells and when I read a single row
 from every cell in the column, I almost fill the workstation memory.
 I didn´t expect that process to use so much memory, but the fact is that
 it
 uses it.
 May be I didn´t explain very well last time.

 Thank you,

 Juanma


 Sorry, still don't understand.
 Can you please post a short code snipped that shows how exactly do you
 read data into your program?

 My impression is that somewhere you use some instruction that triggers
 loading of unnecessary data into memory.


-- 
Antonio Valentino

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Pytables file reading

2012-08-05 Thread Juan Manuel Vázquez Tovar
Thank you Antonio, I will try

Cheers

Juanma

El Aug 5, 2012, a las 17:32, Antonio Valentino antonio.valent...@tiscali.it 
escribió:

 Hi Juan Manuel,
 
 Il 05/08/2012 22:52, Juan Manuel Vázquez Tovar ha scritto:
 Hi Antonio,
 
 This is the piece of code I use to read the part of the table I need:
 
 data = [case[´loads´][i] for case in table]
 
 where i is the index of the row that I need to read from the matrix (133x6)
 stored in each cell of the column loads.
 
 Juanma
 
 
 that looks perfectly fine to me.
 No idea about what could be the issue :/
 
 You can perfform patrial reads using Table.iterrows:
 
 data = [case[´loads´][i] for case in table.iterrows(start, stop)]
 
 Please also consider that using a single np.array with 1e8 rows instead
 of a list of arrays will allows you to save the memory overhead of 1e8
 array objects.
 Considering that 6 doubles are 48 bytes while an empty np.array takes 80
 bytes
 
 In [64]: sys.getsizeof(np.zeros((0,)))
 Out[64]: 80
 
 you should be able to reduce the memory footprint by far more than an half.
 
 
 cheers
 
 
 2012/8/5 Antonio Valentino antonio.valent...@tiscali.it
 
 Hi Juan Manuel,
 
 Il 05/08/2012 22:28, Juan Manuel Vázquez Tovar ha scritto:
 Hi Antonio,
 
 You are right, I don´t need to load the entire table into memory.
 The fourth column has multidimensional cells and when I read a single row
 from every cell in the column, I almost fill the workstation memory.
 I didn´t expect that process to use so much memory, but the fact is that
 it
 uses it.
 May be I didn´t explain very well last time.
 
 Thank you,
 
 Juanma
 
 
 Sorry, still don't understand.
 Can you please post a short code snipped that shows how exactly do you
 read data into your program?
 
 My impression is that somewhere you use some instruction that triggers
 loading of unnecessary data into memory.
 
 
 -- 
 Antonio Valentino
 
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and 
 threat landscape has changed and how IT managers can respond. Discussions 
 will include endpoint security, mobile security and the latest in malware 
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users