Re: java.lang.UnsupportedOperationException: empty collection

2015-04-28 Thread Robineast
I've tried running your code through spark-shell on both 1.3.0 (pre-built for
Hadoop 2.4 and above) and a recently built snapshot of master. Both work
fine. Running on OS X yosemite. What's your configuration?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-UnsupportedOperationException-empty-collection-tp22677p22686.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.lang.UnsupportedOperationException: empty collection

2015-04-27 Thread xweb
I am running following code on Spark 1.3.0. It is from
https://spark.apache.org/docs/1.3.0/ml-guide.html
On running val model1 = lr.fit(training.toDF) I get
java.lang.UnsupportedOperationException: empty collection

what could be the reason?



import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.{Row, SQLContext}

val conf = new SparkConf().setAppName("SimpleParamsExample")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

// Prepare training data.
// We use LabeledPoint, which is a case class.  Spark SQL can convert RDDs
of case classes
// into DataFrames, where it uses the case class metadata to infer the
schema.
val training = sc.parallelize(Seq(
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5

// Create a LogisticRegression instance.  This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Learn a LogisticRegression model.  This uses the parameters stored in lr.
*val model1 = lr.fit(training.toDF)*


*Some more information:*
scala> training.toDF
res26: org.apache.spark.sql.DataFrame = [label: double, features: vecto]

scala> training.toDF.collect()
res27: Array[org.apache.spark.sql.Row] = Array([1.0,[0.0,1.1,0.1]],
[0.0,[2.0,1.0,-1.0]], [0.0,[2.0,1.3,1.0]], [1.0,[0.0,1.2,-0.5]])




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-UnsupportedOperationException-empty-collection-tp22677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
OK. I think I have to use "None" instead null, then it works. Still switching 
from Java.
I can also just use the field name as what I assume.
Great experience.

From: java8...@hotmail.com
To: user@spark.apache.org
Subject: spark left outer join with java.lang.UnsupportedOperationException: 
empty collection
Date: Thu, 12 Feb 2015 18:06:43 -0500




Hi, 
I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 
fields. I know that the first field from both files are IDs. I want to find all 
the IDs existed in the first file, but NOT in the 2nd file.
I am coming with the following code in spark-shell.
case class origAsLeft (id: String)case class newAsRight (id: String)val 
OrigData = sc.textFile("hdfs://firstfile").map(_.split(",")).map( r=>(r(0), 
origAsLeft(r(0val NewData = 
sc.textFile("hdfs://secondfile").map(_.split(",")).map( r=>(r(0), 
newAsRight(r(0val output = OrigData.leftOuterJoin(NewData).filter{ case (k, 
v) => v._2 == null }
Find what I understand, after OrigData left outer join with NewData, it will 
use the id as the key, and a tuple with (leftObject, RightObject).Since I want 
to find out all the IDs existed in the first file, but not in the 2nd one, so 
the output RDD will be the one I want, as it will filter out only when there is 
no newAsRight object from NewData.
Then I run 
output.first
Spark does start to run, but give me the following error message:15/02/12 
16:43:38 INFO scheduler.DAGScheduler: Job 4 finished: first at :21, 
took 78.303549 sjava.lang.UnsupportedOperationException: empty collection   
 at org.apache.spark.rdd.RDD.first(RDD.scala:1095)   at 
$iwC$$iwC$$iwC$$iwC.(:21) at 
$iwC$$iwC$$iwC.(:26)  at $iwC$$iwC.(:28)   at 
$iwC.(:30)at (:32) at .(:36)   
 at .() at .(:7) at .() at 
$print()at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)   at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)   at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at 
org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)   at 
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628) at 
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)   at 
org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)  at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
   at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)   
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)  at 
org.apache.spark.repl.Main$.main(Main.scala:31)  at 
org.apache.spark.repl.Main.main(Main.scala)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)   at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)  at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Did I do anything wrong? What is the way to find all the id in the first file, 
but not in the 2nd file?
Second question is how can I use the object field to do the compare in this 
case? For example, if I define:
case class origAsLeft (id: String, name: String)case class newAsRight (id: 
String, name: String)val OrigData = 
sc.textFile("hdfs://firstfile").map(_.split(",")).map( r=>(r(0), 
origAsLeft(r(0), r(1val NewData = 
sc.textFile("hdfs://secondfile").map(_.split(",")).map( r=>(r(0), 
newAsRight(r(0), r(1// in this case, I want to list all the data in the 
first file which has the same ID as in the 2nd file, but with different value 
in name, I want to do something like below:
val output = OrigData.join(NewData).filter{ case (k, v) => v._1.name != 
v._2.name }
But what is the syntax to use the field in the case class I defined?
Thanks
Yong
  

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
Hi, 
I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 
fields. I know that the first field from both files are IDs. I want to find all 
the IDs existed in the first file, but NOT in the 2nd file.
I am coming with the following code in spark-shell.
case class origAsLeft (id: String)case class newAsRight (id: String)val 
OrigData = sc.textFile("hdfs://firstfile").map(_.split(",")).map( r=>(r(0), 
origAsLeft(r(0val NewData = 
sc.textFile("hdfs://secondfile").map(_.split(",")).map( r=>(r(0), 
newAsRight(r(0val output = OrigData.leftOuterJoin(NewData).filter{ case (k, 
v) => v._2 == null }
Find what I understand, after OrigData left outer join with NewData, it will 
use the id as the key, and a tuple with (leftObject, RightObject).Since I want 
to find out all the IDs existed in the first file, but not in the 2nd one, so 
the output RDD will be the one I want, as it will filter out only when there is 
no newAsRight object from NewData.
Then I run 
output.first
Spark does start to run, but give me the following error message:15/02/12 
16:43:38 INFO scheduler.DAGScheduler: Job 4 finished: first at :21, 
took 78.303549 sjava.lang.UnsupportedOperationException: empty collection   
 at org.apache.spark.rdd.RDD.first(RDD.scala:1095)   at 
$iwC$$iwC$$iwC$$iwC.(:21) at 
$iwC$$iwC$$iwC.(:26)  at $iwC$$iwC.(:28)   at 
$iwC.(:30)at (:32) at .(:36)   
 at .() at .(:7) at .() at 
$print()at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)   at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)   at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at 
org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)   at 
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628) at 
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)   at 
org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)  at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
   at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)   
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)  at 
org.apache.spark.repl.Main$.main(Main.scala:31)  at 
org.apache.spark.repl.Main.main(Main.scala)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)   at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)  at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Did I do anything wrong? What is the way to find all the id in the first file, 
but not in the 2nd file?
Second question is how can I use the object field to do the compare in this 
case? For example, if I define:
case class origAsLeft (id: String, name: String)case class newAsRight (id: 
String, name: String)val OrigData = 
sc.textFile("hdfs://firstfile").map(_.split(",")).map( r=>(r(0), 
origAsLeft(r(0), r(1val NewData = 
sc.textFile("hdfs://secondfile").map(_.split(",")).map( r=>(r(0), 
newAsRight(r(0), r(1// in this case, I want to list all the data in the 
first file which has the same ID as in the 2nd file, but with different value 
in name, I want to do something like below:
val output = OrigData.join(NewData).filter{ case (k, v) => v._1.name != 
v._2.name }
But what is the syntax to use the field in the case class I defined?
Thanks
Yong