Hi Shahab, do you actually need to have a few columns with such a huge amount of categories whose value depends on original value's frequency?
If no, then you may use value's hash code as a category or combine all columns into a single vector using HashingTF. Regards, Filipp. On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <[email protected]> wrote: > Is the StringIndexer keeps all the mapped label to indices in the memory of > the driver machine? It seems to be unless I am missing something. > > What if our data that needs to be indexed is huge and columns to be indexed > are high cardinality (or with lots of categories) and more than one such > column need to be indexed? Meaning it wouldn't fit in memory. > > Thanks. > > Regards, > Shahab --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
