Hi,
I am trying to join two dataframes and able to display the results in the
console ater join. I am saving that data and and saving in the joined data
in CSV format using spark-csv api . Its just saving the column names not
data at all.

Below is the sample code for the reference:

spark-shell   --packages com.databricks:spark-csv_2.10:1.1.0  --master
> yarn-client --driver-memory 512m --executor-memory 512m
>
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.hive.orc._
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> import org.apache.spark.sql.types.{StructType, StructField, StringType,
> IntegerType,FloatType ,LongType ,TimestampType };
>
> val firstSchema = StructType(Seq(StructField("COLUMN1", StringType,
> true),StructField("COLUMN2", StringType, true),StructField("COLUMN2",
> StringType, true),StructField("COLUMN3", StringType, true)
> StructField("COLUMN4", StringType, true),StructField("COLUMN5",
> StringType, true)))
> val file1df =
> hiveContext.read.format("com.databricks.spark.csv").option("header",
> "true").schema(firstSchema).load("/tmp/File1.csv")
>
>
> val secondSchema = StructType(Seq(
> StructField("COLUMN1", StringType, true),
> StructField("COLUMN2", NullType  , true),
> StructField("COLUMN3", TimestampType , true),
> StructField("COLUMN4", TimestampType , true),
> StructField("COLUMN5", NullType , true),
> StructField("COLUMN6", StringType, true),
> StructField("COLUMN7", IntegerType, true),
> StructField("COLUMN8", IntegerType, true),
> StructField("COLUMN9", StringType, true),
> StructField("COLUMN10", IntegerType, true),
> StructField("COLUMN11", IntegerType, true),
> StructField("COLUMN12", IntegerType, true)))
>
>
> val file2df =
> hiveContext.read.format("com.databricks.spark.csv").option("header",
> "false").schema(secondSchema).load("/tmp/file2.csv")
> val joineddf = file1df.join(file2df, file1df("COLUMN1") ===
> file2df("COLUMN6"))
> val selecteddata = joineddf.select(file1df("COLUMN2"),file2df("COLUMN10"))
>
//the below statement is printing the joined data

> joineddf.collect.foreach(println)
>


> //this statement saves the CSVfile but only columns names mentioned above
> on the select are being saved
> selecteddata.write.format("com.databricks.spark.csv").option("header",
> "true").save("/tmp/JoinedData.csv")
>


Would really appreciate the pointers /help.

Thanks,
Divya

Reply via email to