speed up pandas calculation
I know this is a general python list and I am asking about pandas but this question is probably not great for asking on stackoverflow. I have a list of files (~80 files, ~30,000 rows) I need to process with my current code it is take minutes for each file. Any suggestions of a fast way. I am try to stick with pandas for educational purposes. Any suggestions would be great. If you are curious the can find the data file I am using below here. http://www.nber.org/nhamcs/data/nhamcsopd2010.csv drugs_current = {'CITALOPRAM': 4332, 'ESCITALOPRAM': 4812, 'FLUOXETINE': 236, 'FLUVOXAMINE': 3804, 'PAROXETINE': 3157, 'SERTRALINE': 880, 'METHYLPHENIDATE': 900, 'DEXMETHYLPHENIDATE': 4777, 'AMPHETAMINE-DEXTROAMPHETAMINE': 4035, 'DEXTROAMPHETAMINE': 804, 'LISDEXAMFETAMINE': 6663, 'METHAMPHETAMINE': 805, 'ATOMOXETINE': 4827, 'CLONIDINE': 44, 'GUANFACINE': 717} drugs_98_05 = { 'SERTRALINE': 56635, 'CITALOPRAM': 59829, 'FLUOXETINE': 80006, 'PAROXETINE_HCL': 57150, 'FLUVOXAMINE': 57064, 'ESCITALOPRAM': 70466, 'DEXMETHYLPHENIDATE': 70427, 'METHYLPHENIDATE': 70374, 'METHAMPHETAMINE': 53485, 'AMPHETAMINE1': 70257, 'AMPHETAMINE2': 70258, 'AMPHETAMINE3': 50265, 'DEXTROAMPHETAMINE1': 70259, 'DEXTROAMPHETAMINE2': 70260, 'DEXTROAMPHETAMINE3': 51665, 'COMBINATION_PRODUCT': 51380, 'FIXED_COMBINATION': 51381, 'ATOMOXETINE': 70687, 'CLONIDINE1': 51275, 'CLONIDINE2': 70357, 'GUANFACINE': 52498 } df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) col_init = list(df.columns.values) keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] for col in col_init: if col not in keep_col: del df[col] if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98', '99', '00', '91', '02', '03', '04', '05'): drugs = drugs_98_05 elif f[-3:] == 'csv' and f[-6:-4] in ('06', '08', '09', '10'): drugs = drugs_current for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) Vincent Davis 720-301-3003 -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On 31/07/2014 00:04, Vincent Davis wrote: I know this is a general python list and I am asking about pandas but this question is probably not great for asking on stackoverflow. I have a list of files (~80 files, ~30,000 rows) I need to process with my current code it is take minutes for each file. Any suggestions of a fast way. I am try to stick with pandas for educational purposes. Any suggestions would be great. If you are curious the can find the data file I am using below here. http://www.nber.org/nhamcs/data/nhamcsopd2010.csv drugs_current = {'CITALOPRAM': 4332, 'ESCITALOPRAM': 4812, 'FLUOXETINE': 236, 'FLUVOXAMINE': 3804, 'PAROXETINE': 3157, 'SERTRALINE': 880, 'METHYLPHENIDATE': 900, 'DEXMETHYLPHENIDATE': 4777, 'AMPHETAMINE-DEXTROAMPHETAMINE': 4035, 'DEXTROAMPHETAMINE': 804, 'LISDEXAMFETAMINE': 6663, 'METHAMPHETAMINE': 805, 'ATOMOXETINE': 4827, 'CLONIDINE': 44, 'GUANFACINE': 717} drugs_98_05 = { 'SERTRALINE': 56635, 'CITALOPRAM': 59829, 'FLUOXETINE': 80006, 'PAROXETINE_HCL': 57150, 'FLUVOXAMINE': 57064, 'ESCITALOPRAM': 70466, 'DEXMETHYLPHENIDATE': 70427, 'METHYLPHENIDATE': 70374, 'METHAMPHETAMINE': 53485, 'AMPHETAMINE1': 70257, 'AMPHETAMINE2': 70258, 'AMPHETAMINE3': 50265, 'DEXTROAMPHETAMINE1': 70259, 'DEXTROAMPHETAMINE2': 70260, 'DEXTROAMPHETAMINE3': 51665, 'COMBINATION_PRODUCT': 51380, 'FIXED_COMBINATION': 51381, 'ATOMOXETINE': 70687, 'CLONIDINE1': 51275, 'CLONIDINE2': 70357, 'GUANFACINE': 52498 } df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) col_init = list(df.columns.values) keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] for col in col_init: if col not in keep_col: del df[col] if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98', '99', '00', '91', '02', '03', '04', '05'): drugs = drugs_98_05 elif f[-3:] == 'csv' and f[-6:-4] in ('06', '08', '09', '10'): drugs = drugs_current for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) Vincent Davis 720-301-3003 I suggest you ask here https://mail.python.org/mailman/listinfo/pandas-dev which I believe is also gmane.comp.python.pydata. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) col_init = list(df.columns.values) keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] for col in col_init: if col not in keep_col: del df[col] I'm no pandas expert, but a couple things come to mind. First, where is your code slow (profile it, even with a few well-placed prints)? If it's in read_csv there might be little you can do unless you load those data repeatedly, and can save a pickled data frame as a caching measure. Second, you loop over columns deciding one by one whether to keep or toss a column. Instead try df = df[keep_col] Third, if deleting those other columns is costly, can you perhaps just ignore them? Can't be more investigative right now. I don't have pandas on Android. :-) Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis vinc...@vincentdavis.net wrote: The real slow part seems to be for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) I was wrong, this is fast, it was selecting the columns that was slow. using keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] df = df[keep_col] took the time down from 19sec to 2 sec. Vincent Davis 720-301-3003 -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote: I know this is a general python list and I am asking about pandas but this question is probably not great for asking on stackoverflow. I have a list of files (~80 files, ~30,000 rows) I need to process with my current code it is take minutes for each file. Hmmm, is that 30,000 rows per file, or 30,000 files in total? Not that it really matters, I shouldn't expect that it makes that much difference either way. Any suggestions of a fast way. I am try to stick with pandas for educational purposes. Any suggestions would be great. If you are curious the can find the data file I am using below here. http://www.nber.org/nhamcs/data/nhamcsopd2010.csv For brevity, I've trimmed back the dictionaries to something smaller. That's just for ease of reading. drugs_current = {'CITALOPRAM': 4332, 'ESCITALOPRAM': 4812, [...] 'CLONIDINE': 44, 'GUANFACINE': 717} drugs_98_05 = { 'SERTRALINE': 56635, 'CITALOPRAM': 59829, [...] 'CLONIDINE2': 70357, 'GUANFACINE': 52498 } df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) col_init = list(df.columns.values) keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] for col in col_init: if col not in keep_col: del df[col] I expect that this could be your problem here. Deleting from the start or middle of lists is slow, and df may be a list or at least have list-like performance. Suppose you have a list like this: ['a', 'b', 'c', 'd', 'e', ..., 'zzz'] that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you delete item 0, 'a': = ['b', 'c', 'd', 'e', ..., 'zzz'] Python has to move the remaining 18278 items across one space. Then you delete 'b': = ['c', 'd', 'e', ..., 'zzz'] Python has to move the remaining 18276 items across one space, making a total of 36559 moves. And that's just to delete two items. Roughly speaking, if you end up deleting N items from a list starting from the front, Python may have to move as many as N**2 items into their final positions. If you have 5 or 10 columns, that's not too bad, but if you have (say) 80 columns, and delete 70 of them, that could be *horribly* expensive. If you must *repeatedly* use del on lists, it's best to ensure you're deleting from the end, not the start. But even better, and this applies to anything not just lists, is not to delete at all, but to create a new list, copying the columns you want, rather than deleting the columns you don't want. I'm not familiar with pandas and am not sure about the exact syntax needed, but something like: new_df = [] # Assuming df is a list. for col in df: if col.value in keep_col: new_df.append(col) if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98', '99', '00', '91', '02', '03', '04', '05'): Where does f come from? You haven't shown the definition of that. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Wed, Jul 30, 2014 at 5:57 PM, Vincent Davis vinc...@vincentdavis.net wrote: On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis vinc...@vincentdavis.net wrote: The real slow part seems to be for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) I was wrong, this is fast, it was selecting the columns that was slow. using And that shows why profiling is important - before attempting to optimize :). keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] df = df[keep_col] took the time down from 19sec to 2 sec. On Wed, Jul 30, 2014 at 5:57 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: ['a', 'b', 'c', 'd', 'e', ..., 'zzz'] that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you delete item 0, 'a': = ['b', 'c', 'd', 'e', ..., 'zzz'] Python has to move the remaining 18278 items across one space. Then you delete 'b': Really minor issue: I believe this should read 18277 items :). = ['c', 'd', 'e', ..., 'zzz'] I'm not familiar with pandas and am not sure about the exact syntax needed, but something like: new_df = [] # Assuming df is a list. for col in df: if col.value in keep_col: new_df.append(col) Another way to write this, using a list expression (untested): new_df = [col for col in df if col.value in keep_col] Also note that, while the code shows keep_col is fairly short, you may also see performance gains if keep_col is a set ( O(1) lookup performance) rather than a list ( O(n) lookup performance ). You would do this by using: keep_col = set(('PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5')) rather than your existing: keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] This can apply anywhere you use the in operator. Note, however, that generating the set is a bit slower, so you'd want to make sure the set is made outside of a large loop. Chris -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Wed, 30 Jul 2014 18:57:15 -0600, Vincent Davis wrote: On Wed, Jul 30, 2014 at 6:28 PM, Vincent Davis vinc...@vincentdavis.net wrote: The real slow part seems to be for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) I was wrong, this is fast, it was selecting the columns that was slow. using keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] df = df[keep_col] took the time down from 19sec to 2 sec. 19 seconds? I thought you said it was taking multiple minutes? -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
(Now that I'm on a real keyboard, more complete responses are a bit easier.) Regarding the issue of missing columns from keep_col, you could create sets of what you have and what you want, and toss the rest: toss_these = list(set(df.columns) - set(keep_col)) del df[toss_these] Or something to that effect. Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Wed, Jul 30, 2014 at 8:11 PM, Chris Kaynor ckay...@zindagigames.com wrote: Another way to write this, using a list expression (untested): new_df = [col for col in df if col.value in keep_col] As I am learning (often painfully) with pandas and JavaScript+(d3 or jQuery), loops are the enemy. You want to operate on large chunks of data simultaneously. In pandas, those chunks are thinly disguised numpy arrays. In JS+(ds or jQuery), those chunks are selections from the DOM. I should have paid closer attention to the APL unit of my programming languages survey class in college. Skip -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Thursday, July 31, 2014 7:58:59 AM UTC+5:30, Skip Montanaro wrote: As I am learning (often painfully) with pandas and JavaScript+(d3 or jQuery), loops are the enemy. You want to operate on large chunks of data simultaneously. In pandas, those chunks are thinly disguised numpy arrays. In JS+(ds or jQuery), those chunks are selections from the DOM. I should have paid closer attention to the APL unit of my programming languages survey class in college. Much more within reach than you may imagine http://baruchel.hd.free.fr/apps/apl/ -- https://mail.python.org/mailman/listinfo/python-list
Re: speed up pandas calculation
On Wed, Jul 30, 2014 at 5:57 PM, Skip Montanaro skip.montan...@gmail.com wrote: df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', low_memory=False) col_init = list(df.columns.values) keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] for col in col_init: if col not in keep_col: del df[col] I'm no pandas expert, but a couple things come to mind. First, where is your code slow (profile it, even with a few well-placed prints)? If it's in read_csv there might be little you can do unless you load those data repeatedly, and can save a pickled data frame as a caching measure. Second, you loop over columns deciding one by one whether to keep or toss a column. Instead try df = df[keep_col] Third, if deleting those other columns is costly, can you perhaps just ignore them? Can't be more investigative right now. I don't have pandas on Android. :-) So the df = df[keep_col] is not fast but it is not that slow. You made me think of a solution to that part. just slice and copy. The only gotya is that the keep_col have to actually exist keep_col = ['PATCODE', 'PATWT', 'VDAYR', 'VMONTH', 'MED1', 'MED2', 'MED3', 'MED4', 'MED5'] df = df[keep_col] The real slow part seems to be for n in drugs: df[n] = df[['MED1','MED2','MED3','MED4','MED5']].isin([drugs[n]]).any(1) Vincent Davis 720-301-3003 -- https://mail.python.org/mailman/listinfo/python-list