Re: StringIndexer with high cardinality huge data

Filipp Zhinkin Tue, 10 Apr 2018 07:00:51 -0700

Hi Shahab,

do you actually need to have a few columns with such a huge amount of
categories whose value depends on original value's frequency?


If no, then you may use value's hash code as a category or combine all
columns into a single vector using HashingTF.

Regards,
Filipp.

On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <[email protected]> wrote:
> Is the StringIndexer keeps all the mapped label to indices in the memory of
> the driver machine? It seems to be unless I am missing something.
>
> What if our data that needs to be indexed is huge and columns to be indexed
> are high cardinality (or with lots of categories) and more than one such
> column need to be indexed? Meaning it wouldn't fit in memory.
>
> Thanks.
>
> Regards,
> Shahab

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: StringIndexer with high cardinality huge data

Reply via email to