Hi Shahab,

do you actually need to have a few columns with such a huge amount of
categories whose value depends on original value's frequency?

If no, then you may use value's hash code as a category or combine all
columns into a single vector using HashingTF.

Regards,
Filipp.

On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> wrote:
> Is the StringIndexer keeps all the mapped label to indices in the memory of
> the driver machine? It seems to be unless I am missing something.
>
> What if our data that needs to be indexed is huge and columns to be indexed
> are high cardinality (or with lots of categories) and more than one such
> column need to be indexed? Meaning it wouldn't fit in memory.
>
> Thanks.
>
> Regards,
> Shahab

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to