Good question.  This is something we wanted to fix, but unfortunately I'm
not sure how to do it without changing the API to RDD, which is undesirable
now that the 1.0 branch has been cut. We should figure something out though
for 1.1.

I've created https://issues.apache.org/jira/browse/SPARK-1460 to track this.

A few workarounds / hacks.
 - For distinct you can do it in SQL instead of using the Spark function.
 This will preserve the schema.
 - When getting rows it may be more concise to use the extractor instead of
asInstanceOf:
  schemaRDD.map { case Row(key: Int, value: String) => ... }

Michael


On Wed, Apr 9, 2014 at 4:05 PM, Jan-Paul Bultmann <janpaulbultm...@me.com>wrote:

> Hey,
> My application requires the use of "classical" RDD methods like `distinct`
> and `subtract` on SchemaRDDs.
> What is the preferred way to turn the resulting regular
> RDD[org.apache.spark.sql.Row] back into SchemaRDDs?
> Calling toSchemaRDD, will not work as the Schema information seems lost
> already.
> To make matters even more complicated the contents of Row are Any typed.
>
> So to turn make this work one has to map over the result RDD, call
> `asInstanceOf` on the Content and then put that back into
> case classes. Which seems like overkill to me.
>
> Is there a better way, that maybe employs some smart casting or reuse of
> Schema information?
>
> All the best,
> Jan

Reply via email to