I try to build a simple DataFrame that can be used for ML

                SparkConf conf = new SparkConf().setAppName("Simple prediction 
from Text File").setMaster("local");
                SparkContext sc = new SparkContext(conf);
                SQLContext sqlContext = new SQLContext(sc);

                sqlContext.udf().register("vectorBuilder", new VectorBuilder(), 
new VectorUDT());

                String filename = "data/tuple-data-file.csv";
                StructType schema = new StructType(
                                new StructField[] { new StructField("C0", 
DataTypes.StringType, false, Metadata.empty()),
                                                new StructField("C1", 
DataTypes.IntegerType, false, Metadata.empty()),
                                                new StructField("features", new 
VectorUDT(), false, Metadata.empty()), });

                DataFrame df = 
sqlContext.read().format("com.databricks.spark.csv").schema(schema).option("header",
 "false")
                                .load(filename);
                df = df.withColumn("label", df.col("C0")).drop("C0");
                df = df.withColumn("value", df.col("C1")).drop("C1");
                df.printSchema();
Returns:
root
 |-- features: vector (nullable = false)
 |-- label: string (nullable = false)
 |-- value: integer (nullable = false)
                df.show();
Returns:

java.lang.RuntimeException: Unsupported type: vector
        at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
        at scala.collection.AbstractIterator.to(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
16/07/24 12:46:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost): java.lang.RuntimeException: Unsupported type: vector
        at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
        at scala.collection.AbstractIterator.to(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

16/07/24 12:46:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
aborting job
16/07/24 12:46:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
16/07/24 12:46:01 INFO TaskSchedulerImpl: Cancelling stage 0
16/07/24 12:46:01 INFO DAGScheduler: ResultStage 0 (show at 
SimplePredictionFromTextFile.java:50) failed in 0.979 s
16/07/24 12:46:01 INFO DAGScheduler: Job 0 failed: show at 
SimplePredictionFromTextFile.java:50, took 1.204903 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: 
Unsupported type: vector
        at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
        at scala.collection.AbstractIterator.to(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
        at scala.Option.foreach(Option.scala:257)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
        at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
        at 
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
        at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
        at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
        at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
        at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
        at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
        at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
        at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
        at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
        at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
        at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
        at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
        at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
        at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
        at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
        at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
        at 
net.jgp.labs.spark.SimplePredictionFromTextFile.start(SimplePredictionFromTextFile.java:50)
        at 
net.jgp.labs.spark.SimplePredictionFromTextFile.main(SimplePredictionFromTextFile.java:29)
Caused by: java.lang.RuntimeException: Unsupported type: vector
        at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
        at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
        at scala.collection.AbstractIterator.to(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
        at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
16/07/24 12:46:01 INFO SparkContext: Invoking stop() from shutdown hook
16/07/24 12:46:01 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
localhost:52970 in memory (size: 9.8 KB, free: 1140.4 MB)
16/07/24 12:46:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
localhost:52970 in memory (size: 5.4 KB, free: 1140.4 MB)
16/07/24 12:46:01 INFO ContextCleaner: Cleaned accumulator 3
16/07/24 12:46:01 INFO ContextCleaner: Cleaned accumulator 2
16/07/24 12:46:01 INFO SparkUI: Stopped Spark web UI at http://10.0.100.24:4040
16/07/24 12:46:01 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
16/07/24 12:46:01 INFO MemoryStore: MemoryStore cleared
16/07/24 12:46:01 INFO BlockManager: BlockManager stopped
16/07/24 12:46:01 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/24 12:46:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
16/07/24 12:46:01 INFO SparkContext: Successfully stopped SparkContext
16/07/24 12:46:01 INFO ShutdownHookManager: Shutdown hook called
16/07/24 12:46:01 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/vs/kl6qlcvx30707d07txrm_xnw0000gn/T/spark-bc6bb3dc-abc5-4db0-b00c-a17f5b8f9637


Any clue on why it says it can't dump the vector? I saw other examples where it 
did not seem to have a problem...


Reply via email to