Garbage collections issue on MapPartitions

2016-01-29 Thread rcollich
Hi all,

I currently have a mapPartitions job which is flatMapping each value in the
iterator, and I'm running into an issue where there will be major GC costs
on certain executions. Some executors will take 20 minutes, 15 of which are
pure garbage collection, and I believe that a lot of it has to do with the
ArrayBuffer that I am outputting. Does anyone have any suggestions as to how
I can do some form of a stream output?

Also, does anyone have any advice in general for tracking down/addressing GC
issues in spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-collections-issue-on-MapPartitions-tp26104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Setting up data for columnsimilarity

2016-01-28 Thread rcollich
Hi all,

I need to be able to find the cosine similarity of a series of vectors (for
the sake of arguments let's say that every vector is a tweet). However, I'm
having an issue with how I can actually prepare my data to use the
Columnsimilarity function. I'm receiving these vectors in row format and I
can't find any form of a "transpose" function that works well. Has anyone
run into an issue similar to this?

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-data-for-columnsimilarity-tp26098.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org