RE: collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
I didn't realize I do get a nice stack trace if not running in debug mode. 
Basically, I believe Document has to be serializable. 
But since the question has already been asked, are the other requirements for 
objects within an RDD that I should be aware of. serializable is very 
understandable. How about clone, hashCode, etc...

From: ronalday...@live.com
To: user@spark.apache.org
Subject: collecting fails - requirements for collecting (clone, hashCode etc?)
Date: Wed, 3 Dec 2014 07:48:53 -0600




The following code is failing on the collect. If I don't do the collect and go 
with a JavaRDD it works fine. Except I really would like to collect. 
At first I was getting an error regarding JDI threads and an index being 0. 
Then it just started locking up. I'm running the spark context locally on 8 
cores. 

long count = documents.filter(d -> d.getFeatures().size() > 
Parameters.MIN_CENTROID_FEATURES).count();  List 
sampledDocuments = documents.filter(d -> d.getFeatures().size() > 
Parameters.MIN_CENTROID_FEATURES)  .sample(false, 
samplingFraction(count)).collect();



  

collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
The following code is failing on the collect. If I don't do the collect and go 
with a JavaRDD it works fine. Except I really would like to collect. 
At first I was getting an error regarding JDI threads and an index being 0. 
Then it just started locking up. I'm running the spark context locally on 8 
cores. 

long count = documents.filter(d -> d.getFeatures().size() > 
Parameters.MIN_CENTROID_FEATURES).count();  List 
sampledDocuments = documents.filter(d -> d.getFeatures().size() > 
Parameters.MIN_CENTROID_FEATURES)  .sample(false, 
samplingFraction(count)).collect();