Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-28 Thread Colin Alstad
e is that there is an order of magnitude difference between the count of the join DataFrame and the persisted join DataFrame. Secondly, persisting the same DataFrame into 2 different formats yields different results. Does anyone have any idea on what could be going on here? -- Colin Alstad

Issue with PySpark UDF on a column of Vectors

2015-06-17 Thread Colin Alstad
I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from pyspark.sql import Row from pyspark.sql.types import DoubleType from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors FeatureRow = Row('i