You want to reduce the # of partitions to around the # of executors * cores. Since you have so many tasks/partitions which will give a lot of pressure on treeReduce in LoR. Let me know if this helps.
Sincerely, DB Tsai ---------------------------------------------------------- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> On Wed, Sep 23, 2015 at 5:39 PM, Eugene Zhulenev <eugene.zhule...@gmail.com> wrote: > ~3000 features, pretty sparse, I think about 200-300 non zero features in > each row. We have 100 executors x 8 cores. Number of tasks is pretty big, > 30k-70k, can't remember exact number. Training set is a result of pretty > big join from multiple data frames, but it's cached. However as I > understand Spark still keeps DAG history of RDD to be able to recover it in > case of failure of one of the nodes. > > I'll try tomorrow to save train set as parquet, load it back as DataFrame > and run modeling this way. > > On Wed, Sep 23, 2015 at 7:56 PM, DB Tsai <dbt...@dbtsai.com> wrote: > >> Your code looks correct for me. How many # of features do you have in >> this training? How many tasks are running in the job? >> >> >> Sincerely, >> >> DB Tsai >> ---------------------------------------------------------- >> Blog: https://www.dbtsai.com >> PGP Key ID: 0xAF08DF8D >> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> >> >> On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev < >> eugene.zhule...@gmail.com> wrote: >> >>> It's really simple: >>> https://gist.github.com/ezhulenev/7777886517723ca4a353 >>> >>> The same strange heap behavior we've seen even for single model, it >>> takes ~20 gigs heap on a driver to build single model with less than 1 >>> million rows in input data frame. >>> >>> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai <dbt...@dbtsai.com> wrote: >>> >>>> Could you paste some of your code for diagnosis? >>>> >>>> >>>> Sincerely, >>>> >>>> DB Tsai >>>> ---------------------------------------------------------- >>>> Blog: https://www.dbtsai.com >>>> PGP Key ID: 0xAF08DF8D >>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> >>>> >>>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev < >>>> eugene.zhule...@gmail.com> wrote: >>>> >>>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch) >>>>> >>>>> We are running 2-3 LogisticRegression models in parallel (we'd love to >>>>> run 10-20 actually), they are not really big at all, maybe 1-2 million >>>>> rows >>>>> in each model. >>>>> >>>>> Cluster itself, and all executors look good. Enough free memory and no >>>>> exceptions or errors. >>>>> >>>>> However I see very strange behavior inside Spark driver. Allocated >>>>> heap constantly growing. It grows up to 30 gigs in 1.5 hours and then >>>>> everything becomes super sloooooow. >>>>> >>>>> We don't do any collect, and I really don't understand who is >>>>> consuming all this memory. Looks like it's something inside >>>>> LogisticRegression itself, however I only see treeAggregate which should >>>>> not require so much memory to run. >>>>> >>>>> Any ideas? >>>>> >>>>> Plus I don't see any GC pause, looks like memory is still used by >>>>> someone inside driver. >>>>> >>>>> [image: Inline image 2] >>>>> [image: Inline image 1] >>>>> >>>> >>>> >>> >> >