Hi there:
Yeah, I came to that same conclusion after tuning spark sql shuffle
parameter. Also cut out some classes I was using to parse my dataset and
finally created schema only with the fields needed for my model (before
that I was creating it with 63 fields while I just needed 15).
So I came
PS, I will recommend you compress the data when you cache the RDD.
There will be some overhead in compression/decompression, and
serialization/deserialization, but it will help a lot for iterative
algorithms with ability to caching more data.
Sincerely,
DB Tsai
Is that error actually occurring in LBFGS? It looks like it might be
happening before the data even gets to LBFGS. (Perhaps the outer join
you're trying to do is making the dataset size explode a bit.) Are you
able to call count() (or any RDD action) on the data before you pass it to
LBFGS?
On
I would recommend caching; if you can't persist, iterative algorithms will
not work well.
I don't think calling count on the dataset is problematic; every iteration
in LBFGS iterates over the whole dataset and does a lot more computation
than count().
It would be helpful to see some error
Yeah, I can call count before that and it works. Also I was over caching
tables but I removed those. Now there is no caching but it gets really slow
since it calculates my table RDD many times.
Also hacked the LBFGS code to pass the number of examples which I
calculated outside in a Spark SQL
Hi there:
I'm using LBFGS optimizer to train a logistic regression model. The code I
implemented follows the pattern showed in
https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training
data is obtained from a Spark SQL RDD.
The problem I'm having is that LBFGS tries to count the
Can you try increasing your driver memory, reducing the executors and
increasing the executor memory?
Thanks
Best Regards
On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres
gsala...@ime.usp.br wrote:
Hi there:
I'm using LBFGS optimizer to train a logistic regression model. The