Re: Pandas cat.categories.isin list, is this a bug?

2018-05-17 Thread zljubisic
Hi Matt,

> (Including python-list again, for lack of a reason not to. This
> conversation is still relevant and appropriate for the general Python
> mailing list -- I just meant that the pydata list likely has many more
> Pandas users/experts, so you're more likely to get a better answer,
> faster, from a more specialized group.)

OK, for now we will stay here, but in the future I will use pydata as you have 
suggested.

> Selecting all rows that have categories is a bit simpler than what you
> are doing -- your issue is that you are working with the *set of
> distinct categories*, and not the actual vector of categories
> corresponding to your data.

Yes, now I got it thanks to your explanation. 
df.CRM_assetID.cat.categories means unique categories of the CRM_assetID field.
Now I am using df_cat[df_cat.CRM_assetID.isin({'V1254748', 'V805722', 
'V1105400'})].shape to select all rows that have relevant categories.
Thanks for the set instead of list as well. Very good tip.
Everything works as it should now. 

Matt, you were more than helpful.
Thank you very very much.

Best regards.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Pandas cat.categories.isin list, is this a bug?

2018-05-15 Thread Matt Ruffalo
On 2018-05-15 06:23, Zoran Ljubišić wrote:
> Matt,
>
> thanks for the info about pydata mailing group. I didn't know it exists.
> Because comp.lang.python is not appropriate group for this question, I
> will continue our conversation on gmail.
>
> I have put len(df.CRM_assetID.cat
> .categories.isin(['V1254748', 'V805722',
> 'V1105400']))  = 55418 in next message, after I noticed that this
> information is missing.
>
> If I want to select all rows that have categories from the list, how
> to do that?
>
> Regards,
>
> Zoran
>

Hi Zoran-

(Including python-list again, for lack of a reason not to. This
conversation is still relevant and appropriate for the general Python
mailing list -- I just meant that the pydata list likely has many more
Pandas users/experts, so you're more likely to get a better answer,
faster, from a more specialized group.)

Selecting all rows that have categories is a bit simpler than what you
are doing -- your issue is that you are working with the *set of
distinct categories*, and not the actual vector of categories
corresponding to your data.

You can select items you're interested in with something like the following:

"""
In [1]: import pandas as pd

In [2]: s = pd.Series(['apple', 'banana', 'apple', 'pear', 'banana',
'cherry', 'pear', 'cherry']).astype('category')

In [3]: s
Out[3]:
0 apple
1    banana
2 apple
3  pear
4    banana
5    cherry
6  pear
7    cherry
dtype: category
Categories (4, object): [apple, banana, cherry, pear]

In [4]: s.isin({'apple', 'pear'})
Out[4]:
0 True
1    False
2 True
3 True
4    False
5    False
6 True
7    False
dtype: bool

In [5]: s.loc[s.isin({'apple', 'pear'})]
Out[5]:
0    apple
2    apple
3 pear
6 pear
dtype: category
Categories (4, object): [apple, banana, cherry, pear]
"""

(Note that I'm also passing a set to `isin` instead of a list -- this
doesn't matter when looking for two or three values, but if you're
passing 1000 values to `isin`, or 10_000, or 1_000_000, then linear-time
membership testing can start to become an issue.)

You are accessing the vector of the *unique categories* in that column, like

"""
In [6]: s.cat.categories
Out[6]: Index(['apple', 'banana', 'cherry', 'pear'], dtype='object')

In [7]: s.cat.categories.isin({'apple', 'pear'})
Out[7]: array([ True, False, False,  True])
"""

The vector `s.cat.categories` has one element for each distinct category
in your column, and your column apparently contains 55418 different
categories.

MMR...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Pandas cat.categories.isin list, is this a bug?

2018-05-14 Thread Matt Ruffalo
On 2018-05-14 07:05, zljubi...@gmail.com wrote:
> Hi,
>
> I have dataframe with CRM_assetID column as category dtype:
>
> df.info()
>
> 
> RangeIndex: 1435952 entries, 0 to 1435951
> Data columns (total 75 columns):
> startTime1435952 non-null object
> CRM_assetID  1435952 non-null category
>
> searching a dataframe for each of three categories:
>
> df[df.CRM_assetID == 'V1254748'].shape
> (35, 75)
> df[df.CRM_assetID == 'V805722'].shape
> (45, 75)
> df[df.CRM_assetID == 'V1105400'].shape
> (34, 75)
>
>
> len(df.CRM_assetID.cat.categories.isin(['V1254748', 'V805722', 'V1105400']))
>
> Why this len is not equal to 114 (35 + 45 + 34)?
>
> Regards.

Hello-

First, this is a general Python group; not everyone here is necessarily
an expert in or user of Pandas. In the future you might have more
success with the pydata mailing list/group.

When you say that `len(df.CRM_assetID.cat.categories.isin(['V1254748',
'V805722', 'V1105400']))` is not equal to 114, it would be helpful to
say what this length actually is.

Your usage of `df.CRM_assetID.cat.categories` refers to the *unique
categories in that column*, not the actual values in that column.
Presumably you have more categories in that column than the three you
are checking with `isin`, since you are checking the length of a boolean
vector that signifies whether each distinct category is in that list.

MMR...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Pandas cat.categories.isin list, is this a bug?

2018-05-14 Thread zljubisic
On Monday, 14 May 2018 13:05:24 UTC+2, zlju...@gmail.com  wrote:
> Hi,
> 
> I have dataframe with CRM_assetID column as category dtype:
> 
> df.info()
> 
> 
> RangeIndex: 1435952 entries, 0 to 1435951
> Data columns (total 75 columns):
> startTime1435952 non-null object
> CRM_assetID  1435952 non-null category
> 
> searching a dataframe for each of three categories:
> 
> df[df.CRM_assetID == 'V1254748'].shape
> (35, 75)
> df[df.CRM_assetID == 'V805722'].shape
> (45, 75)
> df[df.CRM_assetID == 'V1105400'].shape
> (34, 75)
> 
> 
> len(df.CRM_assetID.cat.categories.isin(['V1254748', 'V805722', 'V1105400']))
> 
> Why this len is not equal to 114 (35 + 45 + 34)?
> 
> Regards.

I forgot to copy result of:

len(df.CRM_assetID.cat.categories.isin(['V1254748', 'V805722', 'V1105400'])) 

which is 55418.
-- 
https://mail.python.org/mailman/listinfo/python-list


Pandas cat.categories.isin list, is this a bug?

2018-05-14 Thread zljubisic
Hi,

I have dataframe with CRM_assetID column as category dtype:

df.info()


RangeIndex: 1435952 entries, 0 to 1435951
Data columns (total 75 columns):
startTime1435952 non-null object
CRM_assetID  1435952 non-null category

searching a dataframe for each of three categories:

df[df.CRM_assetID == 'V1254748'].shape
(35, 75)
df[df.CRM_assetID == 'V805722'].shape
(45, 75)
df[df.CRM_assetID == 'V1105400'].shape
(34, 75)


len(df.CRM_assetID.cat.categories.isin(['V1254748', 'V805722', 'V1105400']))

Why this len is not equal to 114 (35 + 45 + 34)?

Regards.
-- 
https://mail.python.org/mailman/listinfo/python-list