e is that there is an order of magnitude difference
between the count of the join DataFrame and the persisted join DataFrame.
Secondly, persisting the same DataFrame into 2 different formats yields
different results.
Does anyone have any idea on what could be going on here?
--
Colin Alstad
I am having trouble using a UDF on a column of Vectors in PySpark which can
be illustrated here:
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
FeatureRow = Row('i