Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
this use case in a more natural way than HashingTF (and handles multiple
columns at once).



On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <filipp.zhin...@gmail.com>
wrote:

> Hi Shahab,
>
> do you actually need to have a few columns with such a huge amount of
> categories whose value depends on original value's frequency?
>
> If no, then you may use value's hash code as a category or combine all
> columns into a single vector using HashingTF.
>
> Regards,
> Filipp.
>
> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com>
> wrote:
> > Is the StringIndexer keeps all the mapped label to indices in the memory
> of
> > the driver machine? It seems to be unless I am missing something.
> >
> > What if our data that needs to be indexed is huge and columns to be
> indexed
> > are high cardinality (or with lots of categories) and more than one such
> > column need to be indexed? Meaning it wouldn't fit in memory.
> >
> > Thanks.
> >
> > Regards,
> > Shahab
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to