I have a large dataframe of 1 billion rows of type LabeledPoint. I tried to
train a linear regression model on the df but it failed due to lack of
memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of
CPU.

I decided to split my data into multiple chunks and train the model in
multiple phases but I learned the linear regression model in ml library
does not have "setinitialmodel" function to be able to pass the trained
model from one chunk to the rest of chunks. In another word, each time I
call the fit function over a chunk of my data, it overwrites the previous
mode.

So far the only solution I found is using Spark Streaming to be able to
split the data to multiple dfs and then train over each individually to
overcome memory issue.

Do you know if there's any other solution?




On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar <jayantbaya...@gmail.com>
wrote:

> Hello Mahesh,
>
> We have built one. You can download from here : https://www.sparkflows.io/
> download
>
> Feel free to ping me for any questions, etc.
>
> Best Regards,
> Jayant
>
>
> On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <
> mahesh_sawai...@persistent.com> wrote:
>
>> Hi,
>>
>>
>> 1) Is anyone aware of any workbench kind of tool to run ML jobs in spark.
>> Specifically is the tool  could be something like a Web application that is
>> configured to connect to a spark cluster.
>>
>>
>> User is able to select input training sets probably from hdfs , train and
>> then run predictions, without having to write any Scala code.
>>
>>
>> 2) If there is not tool, is there value in having such tool, what could
>> be the challenges.
>>
>>
>> Thanks,
>>
>> Mahesh
>>
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>

Reply via email to