You want to reduce the # of partitions to around the # of executors *
cores. Since you have so many tasks/partitions which will give a lot of
pressure on treeReduce in LoR. Let me know if this helps.


Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Wed, Sep 23, 2015 at 5:39 PM, Eugene Zhulenev <eugene.zhule...@gmail.com>
wrote:

> ~3000 features, pretty sparse, I think about 200-300 non zero features in
> each row. We have 100 executors x 8 cores. Number of tasks is pretty big,
> 30k-70k, can't remember exact number. Training set is a result of pretty
> big join from multiple data frames, but it's cached. However as I
> understand Spark still keeps DAG history of RDD to be able to recover it in
> case of failure of one of the nodes.
>
> I'll try tomorrow to save train set as parquet, load it back as DataFrame
> and run modeling this way.
>
> On Wed, Sep 23, 2015 at 7:56 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>
>> Your code looks correct for me. How many # of features do you have in
>> this training? How many tasks are running in the job?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>
>> On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev <
>> eugene.zhule...@gmail.com> wrote:
>>
>>> It's really simple:
>>> https://gist.github.com/ezhulenev/7777886517723ca4a353
>>>
>>> The same strange heap behavior we've seen even for single model, it
>>> takes ~20 gigs heap on a driver to build single model with less than 1
>>> million rows in input data frame.
>>>
>>> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>
>>>> Could you paste some of your code for diagnosis?
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> ----------------------------------------------------------
>>>> Blog: https://www.dbtsai.com
>>>> PGP Key ID: 0xAF08DF8D
>>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>>>
>>>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev <
>>>> eugene.zhule...@gmail.com> wrote:
>>>>
>>>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch)
>>>>>
>>>>> We are running 2-3 LogisticRegression models in parallel (we'd love to
>>>>> run 10-20 actually), they are not really big at all, maybe 1-2 million 
>>>>> rows
>>>>> in each model.
>>>>>
>>>>> Cluster itself, and all executors look good. Enough free memory and no
>>>>> exceptions or errors.
>>>>>
>>>>> However I see very strange behavior inside Spark driver. Allocated
>>>>> heap constantly growing. It grows up to 30 gigs in 1.5 hours and then
>>>>> everything becomes super sloooooow.
>>>>>
>>>>> We don't do any collect, and I really don't understand who is
>>>>> consuming all this memory. Looks like it's something inside
>>>>> LogisticRegression itself, however I only see treeAggregate which should
>>>>> not require so much memory to run.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Plus I don't see any GC pause, looks like memory is still used by
>>>>> someone inside driver.
>>>>>
>>>>> [image: Inline image 2]
>>>>> [image: Inline image 1]
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to