Re: Nullable is true for the schema of parquet data

2015-05-10 Thread dsgriffin
Ran into this same issue. Only solution seems to be to coerce the DataFrame's schema back into the right state. Looks like you have to convert the DF to an RDD, which has an overhead. But otherwise this worked for me: val newDF = sqlContext.createDataFrame(origDF.rdd, new

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread dsgriffin
val newRdd = myRdd.map(row = row ++ Array((row(1).toLong * row(199).toLong).toString)) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22735.html Sent from the Apache Spark User List mailing list

Re: Drop a column from the DataFrame.

2015-05-02 Thread dsgriffin
Just use select() to create a new DataFrame with only the columns you want. Sort of the opposite of what you want -- but you can select all but the columns you want minus the one you don. You could even use a filter to remove just the one column you want on the fly:

Re: RDD.filter vs. RDD.join--advice please

2015-04-22 Thread dsgriffin
Test it out, but I would be willing to bet the join is going to be a good deal faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-filter-vs-RDD-join-advice-please-tp22612p22614.html Sent from the Apache Spark User List mailing list archive at