SparkSQL exception on spark.sql.codegen

2014-11-15 Thread Eric Zhen
Hi all,

We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we got
exceptions as below, has anyone else saw these before?

java.lang.ExceptionInInitializerError
at
org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
at
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
at
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.NullPointerException
at
scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
at scala.reflect.internal.Types$UniqueType.(Types.scala:1304)
at scala.reflect.internal.Types$TypeRef.(Types.scala:2341)
at
scala.reflect.internal.Types$NoArgsTypeRef.(Types.scala:2137)
at
scala.reflect.internal.Types$TypeRef$$anon$6.(Types.scala:2544)
at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
at
scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
at
scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
at
scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
at
scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
at scala.reflect.api.Universe.typeOf(Universe.scala:59)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.(CodeGenerator.scala:46)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala:29)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.(GenerateProjection.scala)
... 15 more
-- 
Best Regards


Re: SparkSQL exception on cached parquet table

2014-11-15 Thread sadhan
Hi Cheng,

Thanks for your response.Here is the stack trace from yarn logs:





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround:

val a = sc.parallelize(1 to 10).zipWithIndex()
a.partitions // call .partitions explicitly
a.repartition(10).count()

Thanks for reporting the bug! -Xiangrui



On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng  wrote:
> I think I understand where the bug is now. I created a JIRA
> (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
> soon. -Xiangrui
>
> On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng  wrote:
>> This is a bug. Could you make a JIRA? -Xiangrui
>>
>> On Sat, Nov 15, 2014 at 3:27 AM, lev  wrote:
>>> Hi,
>>>
>>> I'm having trouble using both zipWithIndex and repartition. When I use them
>>> both, the following action will get stuck and won't return.
>>> I'm using spark 1.1.0.
>>>
>>>
>>> Those 2 lines work as expected:
>>>
>>> scala> sc.parallelize(1 to 10).repartition(10).count()
>>> res0: Long = 10
>>>
>>> scala> sc.parallelize(1 to 10).zipWithIndex.count()
>>> res1: Long = 10
>>>
>>>
>>> But this statement get stuck and doesn't return:
>>>
>>> scala> sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
>>> 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
>>> Option.scala:120
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
>>> Option.scala:120) with 3 output partitions (allowLocal=false)
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
>>> Option.scala:120)
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
>>> List()
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
>>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
>>> (ParallelCollectionRDD[7] at parallelize at :13), which has no
>>> missing parents
>>> 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
>>> with curMem=7616, maxMem=138938941
>>> 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
>>> values in memory (estimated size 1096.0 B, free 132.5 MB)
>>>
>>>
>>> Am I doing something wrong here or is it a bug?
>>> Is there some work around?
>>>
>>> Thanks,
>>> Lev.
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
I think I understand where the bug is now. I created a JIRA
(https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
soon. -Xiangrui

On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng  wrote:
> This is a bug. Could you make a JIRA? -Xiangrui
>
> On Sat, Nov 15, 2014 at 3:27 AM, lev  wrote:
>> Hi,
>>
>> I'm having trouble using both zipWithIndex and repartition. When I use them
>> both, the following action will get stuck and won't return.
>> I'm using spark 1.1.0.
>>
>>
>> Those 2 lines work as expected:
>>
>> scala> sc.parallelize(1 to 10).repartition(10).count()
>> res0: Long = 10
>>
>> scala> sc.parallelize(1 to 10).zipWithIndex.count()
>> res1: Long = 10
>>
>>
>> But this statement get stuck and doesn't return:
>>
>> scala> sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
>> 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
>> Option.scala:120
>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
>> Option.scala:120) with 3 output partitions (allowLocal=false)
>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
>> Option.scala:120)
>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
>> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
>> (ParallelCollectionRDD[7] at parallelize at :13), which has no
>> missing parents
>> 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
>> with curMem=7616, maxMem=138938941
>> 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
>> values in memory (estimated size 1096.0 B, free 132.5 MB)
>>
>>
>> Am I doing something wrong here or is it a bug?
>> Is there some work around?
>>
>> Thanks,
>> Lev.
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
This is a bug. Could you make a JIRA? -Xiangrui

On Sat, Nov 15, 2014 at 3:27 AM, lev  wrote:
> Hi,
>
> I'm having trouble using both zipWithIndex and repartition. When I use them
> both, the following action will get stuck and won't return.
> I'm using spark 1.1.0.
>
>
> Those 2 lines work as expected:
>
> scala> sc.parallelize(1 to 10).repartition(10).count()
> res0: Long = 10
>
> scala> sc.parallelize(1 to 10).zipWithIndex.count()
> res1: Long = 10
>
>
> But this statement get stuck and doesn't return:
>
> scala> sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
> 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
> Option.scala:120
> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
> Option.scala:120) with 3 output partitions (allowLocal=false)
> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
> Option.scala:120)
> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
> 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
> (ParallelCollectionRDD[7] at parallelize at :13), which has no
> missing parents
> 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
> with curMem=7616, maxMem=138938941
> 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
> values in memory (estimated size 1096.0 B, free 132.5 MB)
>
>
> Am I doing something wrong here or is it a bug?
> Is there some work around?
>
> Thanks,
> Lev.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-15 Thread Xiangrui Meng
If Spark is not installed on the client side, you won't be able to
deserialize the model. Instead of serializing the model object, you
may serialize the model weights array and implement predict on the
client side. -Xiangrui

On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu  wrote:
> I had the same need as those documented back to July archived at
> http://qnalist.com/questions/5013193/client-application-that-calls-spark-and-receives-an-mllib-model-scala-object-not-just-result.
>
> I wonder if anyone would like to share any successful stories.
>
>
> Thanks,
> Xiaoyan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to incrementally compile spark examples using mvn

2014-11-15 Thread Marcelo Vanzin
I haven't tried scala:cc, but you can ask maven to just build a
particular sub-project. For example:

  mvn -pl :spark-examples_2.10 compile

On Sat, Nov 15, 2014 at 5:31 PM, Yiming (John) Zhang  wrote:
> Hi,
>
>
>
> I have already successfully compile and run spark examples. My problem is
> that if I make some modifications (e.g., on SparkPi.scala or LogQuery.scala)
> I have to use “mvn -DskipTests package” to rebuild the whole spark project
> and wait a relatively long time.
>
>
>
> I also tried “mvn scala:cc” as described in
> http://spark.apache.org/docs/latest/building-with-maven.html, but I could
> only get infinite stop like:
>
> [INFO] --- scala-maven-plugin:3.2.0:cc (default-cli) @ spark-parent ---
>
> [INFO] wait for files to compile...
>
>
>
> Is there any method to incrementally compile the examples using mvn? Thank
> you!
>
>
>
> Cheers,
>
> Yiming



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to incrementally compile spark examples using mvn

2014-11-15 Thread Yiming (John) Zhang
Hi,

 

I have already successfully compile and run spark examples. My problem is
that if I make some modifications (e.g., on SparkPi.scala or LogQuery.scala)
I have to use "mvn -DskipTests package" to rebuild the whole spark project
and wait a relatively long time. 

 

I also tried "mvn scala:cc" as described in
http://spark.apache.org/docs/latest/building-with-maven.html, but I could
only get infinite stop like:

[INFO] --- scala-maven-plugin:3.2.0:cc (default-cli) @ spark-parent ---

[INFO] wait for files to compile...

 

Is there any method to incrementally compile the examples using mvn? Thank
you!

 

Cheers,

Yiming



Pagerank implementation

2014-11-15 Thread tom85
Hi,

I wonder if the pagerank implementation is correct. More specifically, I
look at the following function from  PageRank.scala

  
, which is given to Pregel:

def vertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double):
(Double, Double) = {
  val (oldPR, lastDelta) = attr
  val newPR = oldPR + (1.0 - resetProb) * msgSum
  (newPR, newPR - oldPR)
}

This line: val newPR = oldPR + (1.0 - resetProb) * msgSum
makes no sense to me. Should it not be:
val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum 
?

Background: I wanted to implement pagerank with the damping factor (here:
resetProb) divided by the number of nodes in the graph.

Tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pagerank-implementation-tp19013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using data in RDD to specify HDFS directory to write to

2014-11-15 Thread jschindler
UPDATE

I have removed  and added things systematically to the job and have figured
that the inclusion of the construction of the SparkContext object is what is
causing it to fail.

The last run contained the code below.

I keep losing executors apparently and I'm not sure why.  Some of the
relevant spark output is below, will add more on Monday as I must go
participate in wknd activities.

 14/11/15 14:53:43 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20141115145328-0025/3 on hostPort cloudera01.local.company.com:7078 with
8 cores, 512.0 MB RAM
14/11/15 14:53:43 INFO AppClient$ClientActor: Executor updated:
app-20141115145328-0025/3 is now RUNNING
14/11/15 14:53:46 INFO MemoryStore: ensureFreeSpace(1063) called with
curMem=1063, maxMem=309225062
14/11/15 14:53:46 INFO MemoryStore: Block input-0-1416084826000 stored as
bytes to memory (size 1063.0 B, free 294.9 MB)
14/11/15 14:53:46 INFO BlockManagerInfo: Added input-0-1416084826000 in
memory on cloudera01.local.company.com:49902 (size: 1063.0 B, free: 294.9
MB)
14/11/15 14:53:46 INFO BlockManagerMaster: Updated info of block
input-0-1416084826000
14/11/15 14:53:46 WARN BlockManager: Block input-0-1416084826000 already
exists on this machine; not re-adding it
14/11/15 14:53:46 INFO BlockGenerator: Pushed block input-0-1416084826000
14/11/15 14:53:46 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkexecu...@cloudera01.local.company.com:52715/user/Executor#-1518587721]
with ID 3
14/11/15 14:53:47 INFO BlockManagerInfo: Registering block manager
cloudera01.local.company.com:46926 with 294.9 MB RAM
14/11/15 14:53:47 INFO SparkDeploySchedulerBackend: Executor 3 disconnected,
so removing it
14/11/15 14:53:47 ERROR TaskSchedulerImpl: Lost an executor 3 (already
removed): remote Akka client disassociated
14/11/15 14:53:47 INFO AppClient$ClientActor: Executor updated:
app-20141115145328-0025/3 is now EXITED (Command exited with code 1)
14/11/15 14:53:47 INFO SparkDeploySchedulerBackend: Executor
app-20141115145328-0025/3 removed: Command exited with code 1
14/11/15 14:53:47 INFO AppClient$ClientActor: Executor added:
app-20141115145328-0025/4 on
worker-20141114114152-cloudera01.local.company.com-7078
(cloudera01.local.company.com:7078) with 8 cores

BLOCK 2 - last block before app fails:

14/11/15 14:54:15 INFO BlockManagerInfo: Registering block manager
cloudera01.local.uship.com:34335 with 294.9 MB RAM
14/11/15 14:54:16 INFO SparkDeploySchedulerBackend: Executor 9 disconnected,
so removing it
14/11/15 14:54:16 ERROR TaskSchedulerImpl: Lost an executor 9 (already
removed): remote Akka client disassociated
14/11/15 14:54:16 INFO AppClient$ClientActor: Executor updated:
app-20141115145328-0025/9 is now EXITED (Command exited with code 1)
14/11/15 14:54:16 INFO SparkDeploySchedulerBackend: Executor
app-20141115145328-0025/9 removed: Command exited with code 1
14/11/15 14:54:16 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: Master removed our application: FAILED
14/11/15 14:54:16 ERROR TaskSchedulerImpl: Exiting due to error from cluster
scheduler: Master removed our application: FAILED
[hdfs@cloudera01 root]$



import kafka.producer._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark._

import org.json4s._
import org.json4s.native.JsonMethods._

import scala.collection.mutable.Map
import scala.collection.mutable.MutableList

case class Event(EventName: String, Payload: org.json4s.JValue)

object App {

  def main(args: Array[String]) {

val ssc = new StreamingContext("local[2]", "Data", Seconds(20))
ssc.checkpoint("checkpoint")


  val conf = new
SparkConf().setMaster("spark://cloudera01.local.company.com:7077")
  val sc = new SparkContext(conf)



val eventMap = scala.collection.immutable.Map("Events" -> 1)
val pipe = KafkaUtils.createStream(ssc,
"dockerrepo,dockerrepo,dockerrepo", "Cons1", eventMap).map(_._2)


val eventStream = pipe.map(data => {
  parse(data)
}).map(json => {


  implicit val formats = DefaultFormats
  val eventName = (json \ "event").extractOpt[String]
  Event(eventName.getOrElse("*** NO EVENT NAME ***"), json)

})


eventStream.foreach(x => {
  var arr = x.toArray
  x.foreachPartition(y => {
y.foreach(z => {print(z)})

  })
})


ssc.start()
ssc.awaitTermination()

  }

} 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-data-in-RDD-to-specify-HDFS-directory-to-write-to-tp18789p19012.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



using zip gets EOFError error

2014-11-15 Thread chocjy
I was trying to zip the rdd with another rdd. I store my matrix in HDFS and
load it as Ab_rdd = sc.textFile('data/Ab.txt', 100)

If I do
idx = sc.parallelize(range(m),100)  #m is the number of records in Ab.txt
print matrix_Ab.matrix.zip(idx).first()

I got the following error:

If I store my matrix (Ab.txt) locally and use sc.parallelize to create the
rdd, this error doesn’t appear. Anyone knows what's going on? Thanks!

Traceback (most recent call last):
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/l2_exp.py", line
51, in 
print test_obj.execute_l2(matrix_Ab,A,b,x_opt,f_opt)
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/test_l2.py", line
35, in execute_l2
ls.fit()
  File
"/home/jiyan/randomized-matrix-algorithms/spark/src/least_squares.py", line
23, in fit
x = self.projection.execute(self.matrix_Ab, 'solve')
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py",
line 26, in execute
PA = self.__project(matrix, lim)
  File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py",
line 50, in __project
print matrix.zip_with_index(self.sc).first()
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py",
line 881, in first
return self.take(1)[0]
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py",
line 868, in take
iterator =
mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 537, in __call__
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o37.collectPartitions.
: java.lang.ClassCastException: [B cannot be cast to java.lang.String
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)



PySpark worker failed with exception:
Traceback (most recent call last):
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py",
line 73, in main
command = pickleSer._read_with_length(infile)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 142, in _read_with_length
length = read_int(stream)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 337, in read_int
raise EOFError
EOFError

14/11/15 00:36:17 ERROR PythonRDD: Python worker exited unexpectedly
(crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py",
line 73, in main
command = pickleSer._read_with_length(infile)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 142, in _read_with_length
length = read_int(stream)
  File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 337, in read_int
raise EOFError
EOFError

at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:118)
at
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:148)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:81)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559)
Caused by: java.lang.ClassCastException: [B cannot be cast to
java.lang.String
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.f

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
Hi Ognen,Currently, 
"Note that cluster mode is currently not supported for standalone clusters, 
Mesos clusters, or python applications."
So it seems like Yarn + scala is the only option for fire and forget. It 
shouldn't be too hard to create a "proxy" submitter, but yes, that does involve 
another process (potentially server) on that side. I've heard good things about 
Ooyala's server, but haven't got around to trying to set it up. As such, can't 
really comment.
Regards,Ashic. 
> Date: Sat, 15 Nov 2014 09:50:14 -0600
> From: ognen.duzlev...@gmail.com
> To: as...@live.com
> CC: quasi...@gmail.com; user@spark.apache.org
> Subject: Re: Submitting Python Applications from Remote to Master
> 
> Ashic,
> 
> Thanks for your email.
> 
> Two things:
> 
> 1. I think a whole lot of data scientists and other people would love
> it if they could just fire off jobs from their laptops. It is, in my
> opinion, a common desired use case.
> 
> 2. Did anyone actually get the Ooyala job server to work? I asked that
> question 6 months ago and never got a straight answer. I ended up
> writing a middle-layer using Scalatra and actors to submit jobs via an
> API and receive results back in JSON. In that I ran into the inability
> to share the SparkContext "feature" and it took a lot of finagling to
> make things work (but it never felt "production ready").
> 
> Ognen
> 
> On Sat, Nov 15, 2014 at 03:36:43PM +, Ashic Mahtab wrote:
> > Hi Ben,I haven't tried it with Python, but the instructions are the same as 
> > for Scala compiled (jar) apps. What it's saying is that it's not possible 
> > to offload the entire work to the master (ala hadoop) in a fire and forget 
> > (or rather submit-and-forget) manner when running on stand alone. There are 
> > two deployment modes - client and cluster. For standalone, only client is 
> > supported. What this means is that the "submitting process" will be the 
> > driver process (not to be confused with "master"). It should very well be 
> > possible to submit from you laptop to a standalone cluster, but the process 
> > running spark-submit will be alive until the job finishes. If you terminate 
> > the process (via kill-9 or otherwise), then the job will be terminated as 
> > well. The driver process will submit the work to the spark master, which 
> > will do the usually divvying up of tasks, distribution, fault tolerance, 
> > etc. and the results will get reported back to the driver process. 
> > Often it's not possible to have arbitrary access to the spark master, and 
> > if jobs take hours to complete, it's not feasible to have the process 
> > running on the laptop without interruptions, disconnects, etc. As such, a 
> > "gateway" machine is used closer to the spark master that's used to submit 
> > jobs from. That way, the process on the gateway machine lives for the 
> > duration of the job, and no connection from the laptop, etc. is needed. 
> > It's not uncommon to actually have an api to the gateway machine. For 
> > example, Ooyala's job server https://github.com/ooyala/spark-jobserver 
> > provides a restful interface to submit jobs.
> > Does that help?
> > Regards,Ashic.
> > Date: Fri, 14 Nov 2014 13:40:43 -0600
> > Subject: Submitting Python Applications from Remote to Master
> > From: quasi...@gmail.com
> > To: user@spark.apache.org
> > 
> > Hi All,
> > I'm not quite clear on whether submitting a python application to spark 
> > standalone on ec2 is possible. 
> > Am I reading this correctly:
> > *A common deployment strategy is to submit your application from a gateway 
> > machine that is physically co-located with your worker machines (e.g. 
> > Master node in a standalone EC2 cluster). In this setup, client mode is 
> > appropriate. In client mode, the driver is launched directly within the 
> > client spark-submit process, with the input and output of the application 
> > attached to the console. Thus, this mode is especially suitable for 
> > applications that involve the REPL (e.g. Spark shell).Alternatively, if 
> > your application is submitted from a machine far from the worker machines 
> > (e.g. locally on your laptop), it is common to usecluster mode to minimize 
> > network latency between the drivers and the executors. Note that cluster 
> > mode is currently not supported for standalone clusters, Mesos clusters, or 
> > python applications.
> > So I shouldn't be able to do something like:./bin/spark-submit  --master 
> > spark:/x.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
> > From a laptop connecting to a previously launched spark cluster using the 
> > default spark-ec2 script, correct?
> > If I am not mistaken about this then docs are slightly confusing -- the 
> > above example is more or less the example here: 
> > https://spark.apache.org/docs/1.1.0/submitting-applications.html
> > If I am mistaken, apologies, can you help me figure out where I went 
> > wrong?I've also taken to opening port 7077 to 0.0.0.0/0
> > --Ben
> >

Re: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ognen Duzlevski
Ashic,

Thanks for your email.

Two things:

1. I think a whole lot of data scientists and other people would love
it if they could just fire off jobs from their laptops. It is, in my
opinion, a common desired use case.

2. Did anyone actually get the Ooyala job server to work? I asked that
question 6 months ago and never got a straight answer. I ended up
writing a middle-layer using Scalatra and actors to submit jobs via an
API and receive results back in JSON. In that I ran into the inability
to share the SparkContext "feature" and it took a lot of finagling to
make things work (but it never felt "production ready").

Ognen

On Sat, Nov 15, 2014 at 03:36:43PM +, Ashic Mahtab wrote:
> Hi Ben,I haven't tried it with Python, but the instructions are the same as 
> for Scala compiled (jar) apps. What it's saying is that it's not possible to 
> offload the entire work to the master (ala hadoop) in a fire and forget (or 
> rather submit-and-forget) manner when running on stand alone. There are two 
> deployment modes - client and cluster. For standalone, only client is 
> supported. What this means is that the "submitting process" will be the 
> driver process (not to be confused with "master"). It should very well be 
> possible to submit from you laptop to a standalone cluster, but the process 
> running spark-submit will be alive until the job finishes. If you terminate 
> the process (via kill-9 or otherwise), then the job will be terminated as 
> well. The driver process will submit the work to the spark master, which will 
> do the usually divvying up of tasks, distribution, fault tolerance, etc. and 
> the results will get reported back to the driver process. 
> Often it's not possible to have arbitrary access to the spark master, and if 
> jobs take hours to complete, it's not feasible to have the process running on 
> the laptop without interruptions, disconnects, etc. As such, a "gateway" 
> machine is used closer to the spark master that's used to submit jobs from. 
> That way, the process on the gateway machine lives for the duration of the 
> job, and no connection from the laptop, etc. is needed. It's not uncommon to 
> actually have an api to the gateway machine. For example, Ooyala's job server 
> https://github.com/ooyala/spark-jobserver provides a restful interface to 
> submit jobs.
> Does that help?
> Regards,Ashic.
> Date: Fri, 14 Nov 2014 13:40:43 -0600
> Subject: Submitting Python Applications from Remote to Master
> From: quasi...@gmail.com
> To: user@spark.apache.org
> 
> Hi All,
> I'm not quite clear on whether submitting a python application to spark 
> standalone on ec2 is possible. 
> Am I reading this correctly:
> *A common deployment strategy is to submit your application from a gateway 
> machine that is physically co-located with your worker machines (e.g. Master 
> node in a standalone EC2 cluster). In this setup, client mode is appropriate. 
> In client mode, the driver is launched directly within the client 
> spark-submit process, with the input and output of the application attached 
> to the console. Thus, this mode is especially suitable for applications that 
> involve the REPL (e.g. Spark shell).Alternatively, if your application is 
> submitted from a machine far from the worker machines (e.g. locally on your 
> laptop), it is common to usecluster mode to minimize network latency between 
> the drivers and the executors. Note that cluster mode is currently not 
> supported for standalone clusters, Mesos clusters, or python applications.
> So I shouldn't be able to do something like:./bin/spark-submit  --master 
> spark:/x.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
> From a laptop connecting to a previously launched spark cluster using the 
> default spark-ec2 script, correct?
> If I am not mistaken about this then docs are slightly confusing -- the above 
> example is more or less the example here: 
> https://spark.apache.org/docs/1.1.0/submitting-applications.html
> If I am mistaken, apologies, can you help me figure out where I went 
> wrong?I've also taken to opening port 7077 to 0.0.0.0/0
> --Ben
> 
> 
> 

-- 
"Convictions are more dangerous enemies of truth than lies." - Friedrich 
Nietzsche

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
Hi Ben,I haven't tried it with Python, but the instructions are the same as for 
Scala compiled (jar) apps. What it's saying is that it's not possible to 
offload the entire work to the master (ala hadoop) in a fire and forget (or 
rather submit-and-forget) manner when running on stand alone. There are two 
deployment modes - client and cluster. For standalone, only client is 
supported. What this means is that the "submitting process" will be the driver 
process (not to be confused with "master"). It should very well be possible to 
submit from you laptop to a standalone cluster, but the process running 
spark-submit will be alive until the job finishes. If you terminate the process 
(via kill-9 or otherwise), then the job will be terminated as well. The driver 
process will submit the work to the spark master, which will do the usually 
divvying up of tasks, distribution, fault tolerance, etc. and the results will 
get reported back to the driver process. 
Often it's not possible to have arbitrary access to the spark master, and if 
jobs take hours to complete, it's not feasible to have the process running on 
the laptop without interruptions, disconnects, etc. As such, a "gateway" 
machine is used closer to the spark master that's used to submit jobs from. 
That way, the process on the gateway machine lives for the duration of the job, 
and no connection from the laptop, etc. is needed. It's not uncommon to 
actually have an api to the gateway machine. For example, Ooyala's job server 
https://github.com/ooyala/spark-jobserver provides a restful interface to 
submit jobs.
Does that help?
Regards,Ashic.
Date: Fri, 14 Nov 2014 13:40:43 -0600
Subject: Submitting Python Applications from Remote to Master
From: quasi...@gmail.com
To: user@spark.apache.org

Hi All,
I'm not quite clear on whether submitting a python application to spark 
standalone on ec2 is possible. 
Am I reading this correctly:
*A common deployment strategy is to submit your application from a gateway 
machine that is physically co-located with your worker machines (e.g. Master 
node in a standalone EC2 cluster). In this setup, client mode is appropriate. 
In client mode, the driver is launched directly within the client spark-submit 
process, with the input and output of the application attached to the console. 
Thus, this mode is especially suitable for applications that involve the REPL 
(e.g. Spark shell).Alternatively, if your application is submitted from a 
machine far from the worker machines (e.g. locally on your laptop), it is 
common to usecluster mode to minimize network latency between the drivers and 
the executors. Note that cluster mode is currently not supported for standalone 
clusters, Mesos clusters, or python applications.
So I shouldn't be able to do something like:./bin/spark-submit  --master 
spark:/x.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
>From a laptop connecting to a previously launched spark cluster using the 
>default spark-ec2 script, correct?
If I am not mistaken about this then docs are slightly confusing -- the above 
example is more or less the example here: 
https://spark.apache.org/docs/1.1.0/submitting-applications.html
If I am mistaken, apologies, can you help me figure out where I went wrong?I've 
also taken to opening port 7077 to 0.0.0.0/0
--Ben


  

Re: saveAsTextFile error

2014-11-15 Thread Prannoy
Hi Niko, 

Have you tried it running keeping the wordCounts.print() ?? Possibly the
import to the package *org.apache.spark.streaming._* is not there so during
sbt package it is unable to locate the saveAsTextFile API. 

Go to
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
to check if all the needed packages are there. 

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-error-tp18960p19006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian

Hi Sadhan,

Could you please provide the stack trace of the 
|ArrayIndexOutOfBoundsException| (if any)? The reason why the first 
query succeeds is that Spark SQL doesn’t bother reading all data from 
the table to give |COUNT(*)|. In the second case, however, the whole 
table is asked to be cached lazily via the |cacheTable| call, thus it’s 
scanned to build the in-memory columnar cache. Then thing went wrong 
while scanning this LZO compressed Parquet file. But unfortunately the 
stack trace at hand doesn’t indicate the root cause.


Cheng

On 11/15/14 5:28 AM, Sadhan Sood wrote:

While testing SparkSQL on a bunch of parquet files (basically used to 
be a partition for one of our hive tables), I encountered this error:


import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val parquetFileRDD = sqlContext.parquetFile(parquetFile)
parquetFileRDD.registerTempTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- 
works fine

sqlContext.cacheTable("xyz_20141109")
sqlContext.sql("SELECT count(*)  FROM xyz_20141109").collect() <-- 
fails with an exception


parquet.io.ParquetDecodingException: Can not read value at 0 in block 
-1 in file 
hdfs://::9000/event_logs/xyz/20141109/part-9359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-9.lzo.parquet


at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)


at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)


at 
org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:145)


at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)


at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)

at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)

at 
org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.hasNext(InMemoryColumnarTableScan.scala:136)


at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)


at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)


at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)


at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)


at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)


at org.apache.spark.scheduler.Task.run(Task.scala:56)

at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)


at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)


at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)


at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException



​


repartition combined with zipWithIndex get stuck

2014-11-15 Thread lev
Hi,

I'm having trouble using both zipWithIndex and repartition. When I use them
both, the following action will get stuck and won't return.
I'm using spark 1.1.0.


Those 2 lines work as expected:

scala> sc.parallelize(1 to 10).repartition(10).count()
res0: Long = 10

scala> sc.parallelize(1 to 10).zipWithIndex.count()
res1: Long = 10


But this statement get stuck and doesn't return:

scala> sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
Option.scala:120
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
Option.scala:120) with 3 output partitions (allowLocal=false)
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
Option.scala:120)
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
(ParallelCollectionRDD[7] at parallelize at :13), which has no
missing parents
14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
with curMem=7616, maxMem=138938941
14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
values in memory (estimated size 1096.0 B, free 132.5 MB)


Am I doing something wrong here or is it a bug?
Is there some work around?

Thanks,
Lev.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Help with Spark Streaming

2014-11-15 Thread Bahubali Jain
Hi,
Trying to use spark streaming, but I am struggling with word count :(
I want consolidate output of the word count (not on a per window basis), so
I am using updateStateByKey(), but for some reason this is not working.
The function it self is not being invoked(do not see the sysout output on
console).


public final class WordCount {
  private static final Pattern SPACE = Pattern.compile(" ");

  public static void main(String[] args) {
if (args.length < 2) {
  System.err.println("Usage: JavaNetworkWordCount 
");
  System.exit(1);
}

 // Create the context with a 1 second batch size
SparkConf sparkConf = new
SparkConf().setAppName("JavaNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf,  new
Duration(1000));
ssc.checkpoint("/tmp/worcount");
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by
'nc')
// Note that no duplication in storage level only for running
locally.
// Replication necessary in distributed scenario for fault
tolerance.
JavaReceiverInputDStream lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream words = lines.flatMap(new
FlatMapFunction() {
  @Override
  public Iterable call(String x) {
return Lists.newArrayList(SPACE.split(x));
  }
});

JavaPairDStream wordCounts = words.mapToPair(
  new PairFunction() {
@Override
public Tuple2 call(String s) {
System.err.println("Got "+s);
  return new Tuple2(s, 1);
}
  }).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
  return i1 + i2;
}
  });

wordCounts.print();

*wordCounts.updateStateByKey(new updateFunction());*
 ssc.start();
ssc.awaitTermination();
  }
}

class updateFunction implements Function2, Optional,
Optional>
{

  @Override public Optional call(List values,
Optional state) {

 Integer x = new Integer(0);
 for (Integer i:values)
 x = x+i;
Integer newSum = state.or(0)+x;  // add the new values with the
previous running count to get the new count
System.out.println("Newsum is "+newSum);
return Optional.of(newSum);

  };

}