Also check out FeatureHasher in Spark 2.3.0 which is designed to handle this use case in a more natural way than HashingTF (and handles multiple columns at once).
On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <filipp.zhin...@gmail.com> wrote: > Hi Shahab, > > do you actually need to have a few columns with such a huge amount of > categories whose value depends on original value's frequency? > > If no, then you may use value's hash code as a category or combine all > columns into a single vector using HashingTF. > > Regards, > Filipp. > > On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> > wrote: > > Is the StringIndexer keeps all the mapped label to indices in the memory > of > > the driver machine? It seems to be unless I am missing something. > > > > What if our data that needs to be indexed is huge and columns to be > indexed > > are high cardinality (or with lots of categories) and more than one such > > column need to be indexed? Meaning it wouldn't fit in memory. > > > > Thanks. > > > > Regards, > > Shahab > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >