Re: what is happening in panda "where" clause

2017-09-23 Thread Pavol Lisy
On 9/22/17, Peter Otten <__pete...@web.de> wrote:
> Exposito, Pedro (RIS-MDW) wrote:
>
>> This code does a "where" clause on a panda data frame...
>>
>> Code:
>> import pandas as pd;
>> col_names = ['Name', 'Age', 'Weight', "Education"];
>> # create panda dataframe
>> x = pd.read_csv('test.dat', sep='|', header=None, names = col_names);
>> # apply "where" condition
>> z = x[ (x['Age'] == 55) ]
>> # prints row WHERE age == 55
>> print (z);
>>
>> What is happening in this statement:
>> z = x[ (x['Age'] == 55) ]
>>
>> Thanks,
>
> Let's take it apart into individual steps:
>
> Make up example data:
>
 import pandas as pd
 x = pd.DataFrame([["Jim", 44], ["Sue", 55], ["Alice", 66]],
> columns=["Name", "Age"])
 x
> Name  Age
> 0Jim   44
> 1Sue   55
> 2  Alice   66
>
> Have a look at the inner expression:
>
 x["Age"] == 55
> 0False
> 1 True
> 2False
>
> So this is a basically vector of boolean values. If you want more details:
> in numpy operations involving a a scalar and an array work via
> "broadcasting". In pure Python you would write something similar as
>
 [v == 55 for v in x["Age"]]
> [False, True, False]
>
> Use the result as an index:
>
 x[[False, True, True]]
> Name  Age
> 1Sue   55
> 2  Alice   66
>
> [2 rows x 2 columns]
>
> This is again in line with numpy arrays -- if you pass an array of boolean
> values as an index the values in the True positions are selected. In pure
> Python you could achieve that with
>
 index = [v == 55 for v in x["Age"]]
 index
> [False, True, False]
 [v for b, v in zip(index, x["Age"]) if b]
> [55]

You could also write:

>>> x[index]
x[index]
  Name  Age
1  Sue   55

As Peter told, understanding numpy or "numpy like" behavior behind
curtain could be key to understanding what is happening.

Look at this:

>>> def half_age(a):
>>> return a/2

we could use this function in where() ->

>>> x[half_age(x.Age)<30]
  Name  Age
0  Jim   44
1  Sue   55

You could think that any function could be used, but it is not true.
a/2 (where a is numpy array) is defined. (I am not sure that it is
really numpy under the hood or if it will be in future pandas versions
but it seems so).

But what if we have more universal function?

>>> def first_char(a):
>>> return a[0].lower()

>>> x[first_char(x.Name)=='a']
... ERROR ... (first_char could not work with pandas serie as a
argument. It means pandas Series doesn't have lower() function)

But you could create index like Peter described. It could be simpler
to remember if you are using pandas not often:

>>> x[[first_char(i)=='a' for i in x.Name]]
Name  Age
2  Alice   66

It is not (IMHO) best solution because you are creating list in memory
(which could be big) . Unfortunately my version of pandas (0.20.3)
doesn't support generator

>>> x[(first_char(i)=='a' for i in x.Name)]
... ERROR...

But you could apply more complex function to Series!

>>> x[x.Name.apply(first_char)=='a']
Name  Age
2  Alice   66

x.Name.apply(first_char) is Series where first_char is applied on each
value. And serie (and numpy.array) understand == operator (and return
value is Series. Which is acceptable as where function parameter)

Although I am not sure that actual version of pandas doesn't create
whole Series in memory :) I expect it is more optimized (and could be
even more in future).

Be aware that I am not expert and I am using pandas only very seldom
just for my little statistics!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: what is happening in panda "where" clause

2017-09-22 Thread Peter Otten
Exposito, Pedro (RIS-MDW) wrote:

> This code does a "where" clause on a panda data frame...
> 
> Code:
> import pandas as pd;
> col_names = ['Name', 'Age', 'Weight', "Education"];
> # create panda dataframe
> x = pd.read_csv('test.dat', sep='|', header=None, names = col_names);
> # apply "where" condition
> z = x[ (x['Age'] == 55) ]
> # prints row WHERE age == 55
> print (z);
> 
> What is happening in this statement:
> z = x[ (x['Age'] == 55) ]
> 
> Thanks,

Let's take it apart into individual steps:

Make up example data:

>>> import pandas as pd
>>> x = pd.DataFrame([["Jim", 44], ["Sue", 55], ["Alice", 66]], 
columns=["Name", "Age"])
>>> x
Name  Age
0Jim   44
1Sue   55
2  Alice   66

Have a look at the inner expression:

>>> x["Age"] == 55
0False
1 True
2False

So this is a basically vector of boolean values. If you want more details: 
in numpy operations involving a a scalar and an array work via 
"broadcasting". In pure Python you would write something similar as

>>> [v == 55 for v in x["Age"]]
[False, True, False]

Use the result as an index:

>>> x[[False, True, True]]
Name  Age
1Sue   55
2  Alice   66

[2 rows x 2 columns]

This is again in line with numpy arrays -- if you pass an array of boolean 
values as an index the values in the True positions are selected. In pure 
Python you could achieve that with

>>> index = [v == 55 for v in x["Age"]]
>>> index
[False, True, False]
>>> [v for b, v in zip(index, x["Age"]) if b]
[55]


-- 
https://mail.python.org/mailman/listinfo/python-list


what is happening in panda "where" clause

2017-09-22 Thread Exposito, Pedro (RIS-MDW)
This code does a "where" clause on a panda data frame...

Code:
import pandas as pd;
col_names = ['Name', 'Age', 'Weight', "Education"];
# create panda dataframe
x = pd.read_csv('test.dat', sep='|', header=None, names = col_names);
# apply "where" condition
z = x[ (x['Age'] == 55) ]
# prints row WHERE age == 55
print (z);

What is happening in this statement:
z = x[ (x['Age'] == 55) ]

Thanks,
Pedro Exposito




 The information contained in this 
e-mail message is intended only for the personal and confidential use of the 
recipient(s) named above. This message may be an attorney-client communication 
and/or work product and as such is privileged and confidential. If the reader 
of this message is not the intended recipient or an agent responsible for 
delivering it to the intended recipient, you are hereby notified that you have 
received this document in error and that any review, dissemination, 
distribution, or copying of this message is strictly prohibited. If you have 
received this communication in error, please notify us immediately by e-mail, 
and delete the original message.  
-- 
https://mail.python.org/mailman/listinfo/python-list