Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea,

Could you let us know which ML algorithm you use? What's the number
instances and dimension of your dataset?
AFAIK, Spark MLlib can train model with several millions of feature if you
configure it correctly.

Thanks
Yanbo

On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet  wrote:

> SGD is supported. I see I assumed you were using Scala. Looks like you can
> do streaming regression, not sure of pyspark API though:
>
> https://spark.apache.org/docs/latest/mllib-linear-methods.
> html#streaming-linear-regression
>
> On 23 August 2017 at 18:22, Sea aj  wrote:
>
>> Thanks for the reply.
>>
>> As far as I understood mini batch is not yet supported in ML libarary. As
>> for MLLib minibatch, I could not find any pyspark api.
>>
>>
>>
>>  Sent with Mailtrack
>> 
>>
>> On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet  wrote:
>>
>>> It depends on what model you would like to train but models requiring
>>> optimisation could use SGD with mini batches. See:
>>> https://spark.apache.org/docs/latest/mllib-optimization.html
>>> #stochastic-gradient-descent-sgd
>>>
>>> On 23 August 2017 at 14:27, Sea aj  wrote:
>>>
 Hi,

 I am trying to feed a huge dataframe to a ml algorithm in Spark but it
 crashes due to the shortage of memory.

 Is there a way to train the model on a subset of the data in multiple
 steps?

 Thanks



  Sent with Mailtrack
 

>>>
>>>
>>>
>>> --
>>>
>>> Mehmet Süzen, MSc, PhD
>>> 
>>>
>>> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
>>> and any documents, files or previous e-mail messages attached to it, may
>>> contain confidential information that is legally privileged. If you are not
>>> the intended recipient or a person responsible for delivering it to the
>>> intended recipient, you are hereby notified that any disclosure, copying,
>>> distribution or use of any of the information contained in or attached to
>>> this transmission is STRICTLY PROHIBITED within the applicable law. If you
>>> have received this transmission in error, please: (1) immediately notify me
>>> by reply e-mail to su...@acm.org,  and (2) destroy the original
>>> transmission and its attachments without reading or saving in any manner. |
>>>
>>
>>
>
>
> --
>
> Mehmet Süzen, MSc, PhD
> 
>
> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
> any documents, files or previous e-mail messages attached to it, may
> contain confidential information that is legally privileged. If you are not
> the intended recipient or a person responsible for delivering it to the
> intended recipient, you are hereby notified that any disclosure, copying,
> distribution or use of any of the information contained in or attached to
> this transmission is STRICTLY PROHIBITED within the applicable law. If you
> have received this transmission in error, please: (1) immediately notify me
> by reply e-mail to su...@acm.org,  and (2) destroy the original
> transmission and its attachments without reading or saving in any manner. |
>


Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
SGD is supported. I see I assumed you were using Scala. Looks like you can
do streaming regression, not sure of pyspark API though:

https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression

On 23 August 2017 at 18:22, Sea aj  wrote:

> Thanks for the reply.
>
> As far as I understood mini batch is not yet supported in ML libarary. As
> for MLLib minibatch, I could not find any pyspark api.
>
>
>
>  Sent with Mailtrack
> 
>
> On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet  wrote:
>
>> It depends on what model you would like to train but models requiring
>> optimisation could use SGD with mini batches. See:
>> https://spark.apache.org/docs/latest/mllib-optimization.html
>> #stochastic-gradient-descent-sgd
>>
>> On 23 August 2017 at 14:27, Sea aj  wrote:
>>
>>> Hi,
>>>
>>> I am trying to feed a huge dataframe to a ml algorithm in Spark but it
>>> crashes due to the shortage of memory.
>>>
>>> Is there a way to train the model on a subset of the data in multiple
>>> steps?
>>>
>>> Thanks
>>>
>>>
>>>
>>>  Sent with Mailtrack
>>> 
>>>
>>
>>
>>
>> --
>>
>> Mehmet Süzen, MSc, PhD
>> 
>>
>> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
>> any documents, files or previous e-mail messages attached to it, may
>> contain confidential information that is legally privileged. If you are not
>> the intended recipient or a person responsible for delivering it to the
>> intended recipient, you are hereby notified that any disclosure, copying,
>> distribution or use of any of the information contained in or attached to
>> this transmission is STRICTLY PROHIBITED within the applicable law. If you
>> have received this transmission in error, please: (1) immediately notify me
>> by reply e-mail to su...@acm.org,  and (2) destroy the original
>> transmission and its attachments without reading or saving in any manner. |
>>
>
>


-- 

Mehmet Süzen, MSc, PhD


| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
any documents, files or previous e-mail messages attached to it, may
contain confidential information that is legally privileged. If you are not
the intended recipient or a person responsible for delivering it to the
intended recipient, you are hereby notified that any disclosure, copying,
distribution or use of any of the information contained in or attached to
this transmission is STRICTLY PROHIBITED within the applicable law. If you
have received this transmission in error, please: (1) immediately notify me
by reply e-mail to su...@acm.org,  and (2) destroy the original
transmission and its attachments without reading or saving in any manner. |


Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Thanks for the reply.

As far as I understood mini batch is not yet supported in ML libarary. As
for MLLib minibatch, I could not find any pyspark api.



 Sent with Mailtrack


On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet  wrote:

> It depends on what model you would like to train but models requiring
> optimisation could use SGD with mini batches. See:
> https://spark.apache.org/docs/latest/mllib-optimization.
> html#stochastic-gradient-descent-sgd
>
> On 23 August 2017 at 14:27, Sea aj  wrote:
>
>> Hi,
>>
>> I am trying to feed a huge dataframe to a ml algorithm in Spark but it
>> crashes due to the shortage of memory.
>>
>> Is there a way to train the model on a subset of the data in multiple
>> steps?
>>
>> Thanks
>>
>>
>>
>>  Sent with Mailtrack
>> 
>>
>
>
>
> --
>
> Mehmet Süzen, MSc, PhD
> 
>
> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
> any documents, files or previous e-mail messages attached to it, may
> contain confidential information that is legally privileged. If you are not
> the intended recipient or a person responsible for delivering it to the
> intended recipient, you are hereby notified that any disclosure, copying,
> distribution or use of any of the information contained in or attached to
> this transmission is STRICTLY PROHIBITED within the applicable law. If you
> have received this transmission in error, please: (1) immediately notify me
> by reply e-mail to su...@acm.org,  and (2) destroy the original
> transmission and its attachments without reading or saving in any manner. |
>


Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
It depends on what model you would like to train but models requiring
optimisation could use SGD with mini batches. See:
https://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd

On 23 August 2017 at 14:27, Sea aj  wrote:

> Hi,
>
> I am trying to feed a huge dataframe to a ml algorithm in Spark but it
> crashes due to the shortage of memory.
>
> Is there a way to train the model on a subset of the data in multiple
> steps?
>
> Thanks
>
>
>
>  Sent with Mailtrack
> 
>



-- 

Mehmet Süzen, MSc, PhD


| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
any documents, files or previous e-mail messages attached to it, may
contain confidential information that is legally privileged. If you are not
the intended recipient or a person responsible for delivering it to the
intended recipient, you are hereby notified that any disclosure, copying,
distribution or use of any of the information contained in or attached to
this transmission is STRICTLY PROHIBITED within the applicable law. If you
have received this transmission in error, please: (1) immediately notify me
by reply e-mail to su...@acm.org,  and (2) destroy the original
transmission and its attachments without reading or saving in any manner. |


Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Hi,

I am trying to feed a huge dataframe to a ml algorithm in Spark but it
crashes due to the shortage of memory.

Is there a way to train the model on a subset of the data in multiple steps?

Thanks



 Sent with Mailtrack