I didn't realize I do get a nice stack trace if not running in debug mode.
Basically, I believe Document has to be serializable.
But since the question has already been asked, are the other requirements for
objects within an RDD that I should be aware of. serializable is very
understandable. How about clone, hashCode, etc...
From: ronalday...@live.com
To: user@spark.apache.org
Subject: collecting fails - requirements for collecting (clone, hashCode etc?)
Date: Wed, 3 Dec 2014 07:48:53 -0600
The following code is failing on the collect. If I don't do the collect and go
with a JavaRDD it works fine. Except I really would like to collect.
At first I was getting an error regarding JDI threads and an index being 0.
Then it just started locking up. I'm running the spark context locally on 8
cores.
long count = documents.filter(d -> d.getFeatures().size() >
Parameters.MIN_CENTROID_FEATURES).count(); List
sampledDocuments = documents.filter(d -> d.getFeatures().size() >
Parameters.MIN_CENTROID_FEATURES) .sample(false,
samplingFraction(count)).collect();