speed up pandas calculation

2014-07-30 Thread Vincent Davis
I know this is a general python list and I am asking about pandas but this
question is probably not great for asking on stackoverflow.
I have a list of files (~80 files, ~30,000 rows) I need to process with my
current code it is take minutes for each file. Any suggestions of a fast
way. I am try to stick with pandas for educational purposes. Any
suggestions would be great. If you are curious the can find the data file I
am using below here. http://www.nber.org/nhamcs/data/nhamcsopd2010.csv

drugs_current = {'CITALOPRAM': 4332,
 'ESCITALOPRAM': 4812,
 'FLUOXETINE': 236,
 'FLUVOXAMINE': 3804,
 'PAROXETINE': 3157,
 'SERTRALINE': 880,
 'METHYLPHENIDATE': 900,
 'DEXMETHYLPHENIDATE': 4777,
 'AMPHETAMINE-DEXTROAMPHETAMINE': 4035,
 'DEXTROAMPHETAMINE': 804,
 'LISDEXAMFETAMINE': 6663,
 'METHAMPHETAMINE': 805,
 'ATOMOXETINE': 4827,
 'CLONIDINE': 44,
 'GUANFACINE': 717}

drugs_98_05 = { 'SERTRALINE': 56635,
'CITALOPRAM': 59829,
'FLUOXETINE': 80006,
'PAROXETINE_HCL': 57150,
'FLUVOXAMINE': 57064,
'ESCITALOPRAM': 70466,
'DEXMETHYLPHENIDATE': 70427,
'METHYLPHENIDATE': 70374,
'METHAMPHETAMINE': 53485,
'AMPHETAMINE1': 70257,
'AMPHETAMINE2': 70258,
'AMPHETAMINE3': 50265,
'DEXTROAMPHETAMINE1': 70259,
'DEXTROAMPHETAMINE2': 70260,
'DEXTROAMPHETAMINE3': 51665,
'COMBINATION_PRODUCT': 51380,
'FIXED_COMBINATION': 51381,
'ATOMOXETINE': 70687,
'CLONIDINE1': 51275,
'CLONIDINE2': 70357,
'GUANFACINE': 52498
   }

df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
low_memory=False)
col_init = list(df.columns.values)
keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2',
'MED3', 'MED4', 'MED5']
for col in col_init:
if col not in keep_col:
del df[col]
if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
'99', '00', '91', '02', '03', '04', '05'):
drugs = drugs_98_05
elif f[-3:]  == 'csv' and f[-6:-4] in ('06', '08', '09', '10'):
drugs = drugs_current
for n in drugs:
df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)


Vincent Davis
720-301-3003
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Mark Lawrence

On 31/07/2014 00:04, Vincent Davis wrote:

I know this is a general python list and I am asking about pandas but
this question is probably not great for asking on stackoverflow.
I have a list of files (~80 files, ~30,000 rows) I need to process with
my current code it is take minutes for each file. Any suggestions of a
fast way. I am try to stick with pandas for educational purposes. Any
suggestions would be great. If you are curious the can find the data
file I am using below here.
http://www.nber.org/nhamcs/data/nhamcsopd2010.csv

drugs_current = {'CITALOPRAM': 4332,
  'ESCITALOPRAM': 4812,
  'FLUOXETINE': 236,
  'FLUVOXAMINE': 3804,
  'PAROXETINE': 3157,
  'SERTRALINE': 880,
  'METHYLPHENIDATE': 900,
  'DEXMETHYLPHENIDATE': 4777,
  'AMPHETAMINE-DEXTROAMPHETAMINE': 4035,
  'DEXTROAMPHETAMINE': 804,
  'LISDEXAMFETAMINE': 6663,
  'METHAMPHETAMINE': 805,
  'ATOMOXETINE': 4827,
  'CLONIDINE': 44,
  'GUANFACINE': 717}

drugs_98_05 = { 'SERTRALINE': 56635,
 'CITALOPRAM': 59829,
 'FLUOXETINE': 80006,
 'PAROXETINE_HCL': 57150,
 'FLUVOXAMINE': 57064,
 'ESCITALOPRAM': 70466,
 'DEXMETHYLPHENIDATE': 70427,
 'METHYLPHENIDATE': 70374,
 'METHAMPHETAMINE': 53485,
 'AMPHETAMINE1': 70257,
 'AMPHETAMINE2': 70258,
 'AMPHETAMINE3': 50265,
 'DEXTROAMPHETAMINE1': 70259,
 'DEXTROAMPHETAMINE2': 70260,
 'DEXTROAMPHETAMINE3': 51665,
 'COMBINATION_PRODUCT': 51380,
 'FIXED_COMBINATION': 51381,
 'ATOMOXETINE': 70687,
 'CLONIDINE1': 51275,
 'CLONIDINE2': 70357,
 'GUANFACINE': 52498
}

df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
low_memory=False)
col_init = list(df.columns.values)
keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
'MED2', 'MED3', 'MED4', 'MED5']
for col in col_init:
 if col not in keep_col:
 del df[col]
if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
'99', '00', '91', '02', '03', '04', '05'):
 drugs = drugs_98_05
elif f[-3:]  == 'csv' and f[-6:-4] in ('06', '08', '09', '10'):
 drugs = drugs_current
for n in drugs:
 df[n] =
df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)


Vincent Davis
720-301-3003




I suggest you ask here 
https://mail.python.org/mailman/listinfo/pandas-dev which I believe is 
also gmane.comp.python.pydata.



--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Skip Montanaro
 df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
low_memory=False)
 col_init = list(df.columns.values)
 keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
'MED2', 'MED3', 'MED4', 'MED5']
 for col in col_init:
 if col not in keep_col:
 del df[col]

I'm no pandas expert, but a couple things come to mind. First, where is
your code slow (profile it, even with a few well-placed prints)? If it's in
read_csv there might be little you can do unless you load those data
repeatedly, and can save a pickled data frame as a caching measure. Second,
you loop over columns deciding one by one whether to keep or toss a column.
Instead try

df = df[keep_col]

Third, if deleting those other columns is costly, can you perhaps just
ignore them?

Can't be more investigative right now. I don't have pandas on Android. :-)

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Vincent Davis
On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis vinc...@vincentdavis.net
wrote:

 The real slow part seems to be
 for n in drugs:
 df[n] =
 df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)


​I was wrong, this is fast, it was selecting the columns that was slow.
using
keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3',
'MED4', 'MED5']
df = df[keep_col]

took the time down from 19sec to 2 sec.


Vincent Davis
720-301-3003
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Steven D'Aprano
On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote:

 I know this is a general python list and I am asking about pandas but
 this question is probably not great for asking on stackoverflow. I have
 a list of files (~80 files, ~30,000 rows) I need to process with my
 current code it is take minutes for each file. 

Hmmm, is that 30,000 rows per file, or 30,000 files in total?

Not that it really matters, I shouldn't expect that it makes that much 
difference either way.


 Any suggestions of a fast
 way. I am try to stick with pandas for educational purposes. Any
 suggestions would be great. If you are curious the can find the data
 file I am using below here.
 http://www.nber.org/nhamcs/data/nhamcsopd2010.csv


For brevity, I've trimmed back the dictionaries to something smaller. 
That's just for ease of reading.

 drugs_current = {'CITALOPRAM': 4332,
  'ESCITALOPRAM': 4812,
   [...]
  'CLONIDINE': 44,
  'GUANFACINE': 717}
 
 drugs_98_05 = { 'SERTRALINE': 56635,
 'CITALOPRAM': 59829,
  [...]
 'CLONIDINE2': 70357,
 'GUANFACINE': 52498
}
 
 df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
  low_memory=False)
 col_init = list(df.columns.values)
 keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
 'MED2', 'MED3', 'MED4', 'MED5']
 for col in col_init:
 if col not in keep_col:
 del df[col]

I expect that this could be your problem here. Deleting from the start or 
middle of lists is slow, and df may be a list or at least have list-like 
performance. Suppose you have a list like this:

['a', 'b', 'c', 'd', 'e', ..., 'zzz']

that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you 
delete item 0, 'a':

= ['b', 'c', 'd', 'e', ..., 'zzz']

Python has to move the remaining 18278 items across one space. Then you 
delete 'b':

= ['c', 'd', 'e', ..., 'zzz']

Python has to move the remaining 18276 items across one space, making a 
total of 36559 moves. And that's just to delete two items. Roughly 
speaking, if you end up deleting N items from a list starting from the 
front, Python may have to move as many as N**2 items into their final 
positions. If you have 5 or 10 columns, that's not too bad, but if you 
have (say) 80 columns, and delete 70 of them, that could be *horribly* 
expensive.

If you must *repeatedly* use del on lists, it's best to ensure you're 
deleting from the end, not the start. But even better, and this applies 
to anything not just lists, is not to delete at all, but to create a new 
list, copying the columns you want, rather than deleting the columns you 
don't want.

I'm not familiar with pandas and am not sure about the exact syntax 
needed, but something like:

new_df = []  # Assuming df is a list.
for col in df:
if col.value in keep_col:
new_df.append(col)


 if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98',
 '99', '00', '91', '02', '03', '04', '05'):

Where does f come from? You haven't shown the definition of that.



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Chris Kaynor
On Wed, Jul 30, 2014 at 5:57 PM, Vincent Davis vinc...@vincentdavis.net
wrote:

 On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis vinc...@vincentdavis.net
 wrote:

 The real slow part seems to be
 for n in drugs:
 df[n] =
 df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)


 ​I was wrong, this is fast, it was selecting the columns that was slow.
 using


And that shows why profiling is important - before attempting to optimize
:).


  keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2',
 'MED3', 'MED4', 'MED5']
 df = df[keep_col]

 took the time down from 19sec to 2 sec.


 On Wed, Jul 30, 2014 at 5:57 PM, Steven D'Aprano 
steve+comp.lang.pyt...@pearwood.info wrote:

 ['a', 'b', 'c', 'd', 'e', ..., 'zzz']

 that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you
 delete item 0, 'a':

 = ['b', 'c', 'd', 'e', ..., 'zzz']

 Python has to move the remaining 18278 items across one space. Then you
 delete 'b':


Really minor issue: I believe this should read 18277 items :).


 = ['c', 'd', 'e', ..., 'zzz']

 I'm not familiar with pandas and am not sure about the exact syntax
 needed, but something like:

 new_df = []  # Assuming df is a list.
 for col in df:
 if col.value in keep_col:
 new_df.append(col)


Another way to write this, using a list expression (untested):
new_df = [col for col in df if col.value in keep_col]

Also note that, while the code shows keep_col is fairly short, you may also
see performance gains if keep_col is a set ( O(1) lookup performance)
rather than a list ( O(n) lookup performance ). You would do this by using:

keep_col = set(('PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2',
'MED3', 'MED4', 'MED5'))

rather than your existing:

keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2',
'MED3', 'MED4', 'MED5']


This can apply anywhere you use the in operator. Note, however, that
generating the set is a bit slower, so you'd want to make sure the set is
made outside of a large loop.

Chris
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Steven D'Aprano
On Wed, 30 Jul 2014 18:57:15 -0600, Vincent Davis wrote:

 On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis
 vinc...@vincentdavis.net wrote:
 
 The real slow part seems to be
 for n in drugs:
 df[n] =
 df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)


 ​I was wrong, this is fast, it was selecting the columns that was slow.
 using
 keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2',
 'MED3', 'MED4', 'MED5']
 df = df[keep_col]
 
 took the time down from 19sec to 2 sec.


19 seconds? I thought you said it was taking multiple minutes?




-- 
Steven 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Skip Montanaro
(Now that I'm on a real keyboard, more complete responses are a bit easier.)

Regarding the issue of missing columns from keep_col, you could create
sets of what you have and what you want, and toss the rest:

toss_these = list(set(df.columns) - set(keep_col))
del df[toss_these]

Or something to that effect.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Skip Montanaro
On Wed, Jul 30, 2014 at 8:11 PM, Chris Kaynor ckay...@zindagigames.com wrote:
 Another way to write this, using a list expression (untested):
 new_df = [col for col in df if col.value in keep_col]

As I am learning (often painfully) with pandas and JavaScript+(d3 or
jQuery), loops are the enemy. You want to operate on large chunks of
data simultaneously. In pandas, those chunks are thinly disguised
numpy arrays. In JS+(ds or jQuery), those chunks are selections from
the DOM.

I should have paid closer attention to the APL unit of my programming
languages survey class in college.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Rustom Mody
On Thursday, July 31, 2014 7:58:59 AM UTC+5:30, Skip Montanaro wrote:
 As I am learning (often painfully) with pandas and JavaScript+(d3 or
 jQuery), loops are the enemy. You want to operate on large chunks of
 data simultaneously. In pandas, those chunks are thinly disguised
 numpy arrays. In JS+(ds or jQuery), those chunks are selections from
 the DOM.

 I should have paid closer attention to the APL unit of my programming
 languages survey class in college.

Much more within reach than you may imagine
http://baruchel.hd.free.fr/apps/apl/

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: speed up pandas calculation

2014-07-30 Thread Vincent Davis
On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro skip.montan...@gmail.com
wrote:

  df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE',
 low_memory=False)
  col_init = list(df.columns.values)
  keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1',
 'MED2', 'MED3', 'MED4', 'MED5']
  for col in col_init:
  if col not in keep_col:
  del df[col]

 I'm no pandas expert, but a couple things come to mind. First, where is
 your code slow (profile it, even with a few well-placed prints)? If it's in
 read_csv there might be little you can do unless you load those data
 repeatedly, and can save a pickled data frame as a caching measure. Second,
 you loop over columns deciding one by one whether to keep or toss a column.
 Instead try

 df = df[keep_col]

 Third, if deleting those other columns is costly, can you perhaps just
 ignore them?

 Can't be more investigative right now. I don't have pandas on Android. :-)


So the df = df[keep_col] is not fast but it is not that slow. You made me
think of a solution to that part. just slice and copy. The only gotya is
that the keep_col have to actually exist
 keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3',
'MED4', 'MED5']
df = df[keep_col]

The real slow part seems to be
for n in drugs:
df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1)



Vincent Davis
720-301-3003
-- 
https://mail.python.org/mailman/listinfo/python-list