Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Thanks guys.

@Filipp Zhinkin
Yes, we might have couple of string columns which will have 15million+
unique values which need to be mapped to indices.

@Nick Pentreath
We are on 2.0.2 though I will check it out. Is it better from hashing
collision perspective or can handle large volume of data as well?

Regards,
Shahab

On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath 
wrote:

> Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
> this use case in a more natural way than HashingTF (and handles multiple
> columns at once).
>
>
>
> On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin 
> wrote:
>
>> Hi Shahab,
>>
>> do you actually need to have a few columns with such a huge amount of
>> categories whose value depends on original value's frequency?
>>
>> If no, then you may use value's hash code as a category or combine all
>> columns into a single vector using HashingTF.
>>
>> Regards,
>> Filipp.
>>
>> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus 
>> wrote:
>> > Is the StringIndexer keeps all the mapped label to indices in the
>> memory of
>> > the driver machine? It seems to be unless I am missing something.
>> >
>> > What if our data that needs to be indexed is huge and columns to be
>> indexed
>> > are high cardinality (or with lots of categories) and more than one such
>> > column need to be indexed? Meaning it wouldn't fit in memory.
>> >
>> > Thanks.
>> >
>> > Regards,
>> > Shahab
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
this use case in a more natural way than HashingTF (and handles multiple
columns at once).



On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin 
wrote:

> Hi Shahab,
>
> do you actually need to have a few columns with such a huge amount of
> categories whose value depends on original value's frequency?
>
> If no, then you may use value's hash code as a category or combine all
> columns into a single vector using HashingTF.
>
> Regards,
> Filipp.
>
> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus 
> wrote:
> > Is the StringIndexer keeps all the mapped label to indices in the memory
> of
> > the driver machine? It seems to be unless I am missing something.
> >
> > What if our data that needs to be indexed is huge and columns to be
> indexed
> > are high cardinality (or with lots of categories) and more than one such
> > column need to be indexed? Meaning it wouldn't fit in memory.
> >
> > Thanks.
> >
> > Regards,
> > Shahab
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin
Hi Shahab,

do you actually need to have a few columns with such a huge amount of
categories whose value depends on original value's frequency?

If no, then you may use value's hash code as a category or combine all
columns into a single vector using HashingTF.

Regards,
Filipp.

On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus  wrote:
> Is the StringIndexer keeps all the mapped label to indices in the memory of
> the driver machine? It seems to be unless I am missing something.
>
> What if our data that needs to be indexed is huge and columns to be indexed
> are high cardinality (or with lots of categories) and more than one such
> column need to be indexed? Meaning it wouldn't fit in memory.
>
> Thanks.
>
> Regards,
> Shahab

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer

keeps all the mapped label to indices in the memory of the driver machine?
It seems to be unless I am missing something.

What if our data that needs to be indexed is huge and columns to be indexed
are high cardinality (or with lots of categories) and more than one such
column need to be indexed? Meaning it wouldn't fit in memory.

Thanks.

Regards,
Shahab