~3000 features, pretty sparse, I think about 200-300 non zero features in
each row. We have 100 executors x 8 cores. Number of tasks is pretty big,
30k-70k, can't remember exact number. Training set is a result of pretty
big join from multiple data frames, but it's cached. However as I
understand Spark still keeps DAG history of RDD to be able to recover it in
case of failure of one of the nodes.

I'll try tomorrow to save train set as parquet, load it back as DataFrame
and run modeling this way.

On Wed, Sep 23, 2015 at 7:56 PM, DB Tsai <dbt...@dbtsai.com> wrote:

> Your code looks correct for me. How many # of features do you have in this
> training? How many tasks are running in the job?
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>
> On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev <
> eugene.zhule...@gmail.com> wrote:
>
>> It's really simple:
>> https://gist.github.com/ezhulenev/7777886517723ca4a353
>>
>> The same strange heap behavior we've seen even for single model, it takes
>> ~20 gigs heap on a driver to build single model with less than 1 million
>> rows in input data frame.
>>
>> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>
>>> Could you paste some of your code for diagnosis?
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Blog: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>>
>>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev <
>>> eugene.zhule...@gmail.com> wrote:
>>>
>>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch)
>>>>
>>>> We are running 2-3 LogisticRegression models in parallel (we'd love to
>>>> run 10-20 actually), they are not really big at all, maybe 1-2 million rows
>>>> in each model.
>>>>
>>>> Cluster itself, and all executors look good. Enough free memory and no
>>>> exceptions or errors.
>>>>
>>>> However I see very strange behavior inside Spark driver. Allocated heap
>>>> constantly growing. It grows up to 30 gigs in 1.5 hours and then everything
>>>> becomes super sloooooow.
>>>>
>>>> We don't do any collect, and I really don't understand who is consuming
>>>> all this memory. Looks like it's something inside LogisticRegression
>>>> itself, however I only see treeAggregate which should not require so much
>>>> memory to run.
>>>>
>>>> Any ideas?
>>>>
>>>> Plus I don't see any GC pause, looks like memory is still used by
>>>> someone inside driver.
>>>>
>>>> [image: Inline image 2]
>>>> [image: Inline image 1]
>>>>
>>>
>>>
>>
>

Reply via email to