I'm currently working on various performance tests for large, sparse feature spaces.
For the Criteo DAC data - 45.8 million rows, 34.3 million features (categorical, extremely sparse), the time per iteration for ml.LogisticRegression is about 20-30s. This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet tuned the tree aggregation depth. But the number of partitions can make a difference - generally fewer is better since the cost is mostly communication of the gradient (the gradient computation is < 10% of the per-iteration time). Note that the current impl forces dense arrays for intermediate data structures, increasing the communication cost significantly. See this PR for info: https://github.com/apache/spark/pull/12761. Once sparse data structures are supported for this, the linear models will be orders of magnitude more scalable for sparse data. On Wed, 5 Oct 2016 at 23:37 DB Tsai <dbt...@dbtsai.com> wrote: > With the latest code in the current master, we're successfully > training LOR using Spark ML's implementation with 14M sparse features. > You need to tune the depth of aggregation to make it efficient. > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Web: https://www.dbtsai.com > PGP Key ID: 0x9DCC1DBD7FC7BBB2 > > > On Wed, Oct 5, 2016 at 12:00 PM, Yang <teddyyyy...@gmail.com> wrote: > > anybody had actual experience applying it to real problems of this scale? > > > > thanks > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >