date:20150408

[jira] [Commented] (SPARK-6772) spark sql error when running code on large number of records

2015-04-08 Thread Aditya Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486853#comment-14486853
 ] 

Aditya Parmar commented on SPARK-6772:
--

Please find the code below

import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.JavaSchemaRDD;
import org.apache.spark.sql.api.java.Row;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.StructType;
import org.apache.spark.sql.hive.api.java.JavaHiveContext;

public class engineshow {

   public static void main(String[] args) {

  SparkConf conf = new SparkConf().setAppName("Engine");
 
  JavaSparkContext sc = new JavaSparkContext(conf);
  JavaHiveContext hContext = new JavaHiveContext(sc);

  String sch;
  List fields;
  StructType schema;
  JavaRDD rowRDD;
  JavaRDD input;


  JavaSchemaRDD[] inputs = new JavaSchemaRDD[2];
  sch = "a b c d e f g h i"; // input file schema
  input = sc.textFile("/home/aditya/stocks1.csv");
  fields = new ArrayList();
  for (String fieldName : sch.split(" ")) {
 fields.add(DataType.createStructField(fieldName,
  DataType.StringType, true));
  }
  schema = DataType.createStructType(fields);
  rowRDD = input.map(new Function() {
 public Row call(String record) throws Exception {
   String[] fields = record.split(",");
   Object[] fields_converted = fields;
   return Row.create(fields_converted);
 }
  });

  inputs[0] = hContext.applySchema(rowRDD, schema);
  inputs[0].registerTempTable("comp1");

  sch = "a b";
  fields = new ArrayList();
  for (String fieldName : sch.split(" ")) {
 fields.add(DataType.createStructField(fieldName,
  DataType.StringType, true));
  }

  schema = DataType.createStructType(fields);
  inputs[1] = hContext.sql("select a,b from comp1");
  inputs[1].saveAsTextFile("/home/aditya/outputog");
   }
}


> spark sql error when running code on large number of records
> 
>
> Key: SPARK-6772
> URL: https://issues.apache.org/jira/browse/SPARK-6772
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Aditya Parmar
>
> Hi all ,
> I am getting an Arrayoutboundsindex error when i try to run a simple 
> filtering colums query on a file with 2.5 lac records.runs fine when running 
> on a file with 2k records .
> {code}
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 1]
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 2]
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 3]
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 4]
> 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; 
> aborting job
> 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0
> 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled
> 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at 
> JavaSchemaRDD.scala:42, took 1.914477 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure:

[jira] [Created] (SPARK-6799) Add dataframe examples for SparkR

2015-04-08 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-6799:


 Summary: Add dataframe examples for SparkR
 Key: SPARK-6799
 URL: https://issues.apache.org/jira/browse/SPARK-6799
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical


We should add more data frame usage examples for SparkR . This can be similar 
to the python examples at 
https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-08 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486810#comment-14486810
 ] 

Guoqiang Li commented on SPARK-3937:


The bug seems to be caused by {{spark.storage.memoryFraction 0.2}}.   
{{spark.storage.memoryFraction 0.4}}  won't appear the bug. These may be 
related with the size of the RDD.



> Unsafe memory access inside of Snappy library
> -
>
> Key: SPARK-3937
> URL: https://issues.apache.org/jira/browse/SPARK-3937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>
> This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
> have much information about this other than the stack trace. However, it was 
> concerning enough I figured I should post it.
> {code}
> java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code
> org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
> 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
> 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486794#comment-14486794
 ] 

yangping wu commented on SPARK-6770:


Ok, Thank you very much for your reply. I will try to use pure Spark Streaming 
program and use pure scala jdbc to write data to mysql.

> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>   val sparkConf = new SparkConf()
>   val sc = new SparkContext(sparkConf)
>   val ssc = new StreamingContext(sc, Seconds(10))
>   ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory
>   ssc
> }
> val struct = StructType(StructField("log", StringType) ::Nil)
> // Get StreamingContext from checkpoint data or create a new one
> val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", 
> functionToCreateContext)
> val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topics)
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> SDB.foreachRDD(rdd => {
>   val result = rdd.map(item => {
> println(item)
> val result = item._2 match {
>   case e: String => Row.apply(e)
>   case _ => Row.apply("")
> }
> result
>   })
>   println(result.count())
>   val df = sqlContext.createDataFrame(result, struct)
>   df.insertIntoJDBC(url, "test", overwrite = false)
> })
> ssc.start()
> ssc.awaitTermination()
> ssc.stop()
> {code}
> But when I  recovery the program from checkpoint, I encountered an exception:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
> been initialized
>   at 
> org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
>   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
>   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethod

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486782#comment-14486782
 ] 

Saisai Shao commented on SPARK-6770:


>From my understanding, the basic scenario of your code is trying to put the 
>Kafka data into database using JDBC, and you want to leverage SparkSQL for 
>easy implementation. I think if you want to use checkpoint file to recover 
>from driver failure, it would be better to write a pure Spark Streaming 
>program, the Spark Streaming's checkpointing mechanism only guarantee 
>streaming's related metadata to write and recover. The more you use 
>third-party tools, the less it can recover from current mechanism.

> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>   val sparkConf = new SparkConf()
>   val sc = new SparkContext(sparkConf)
>   val ssc = new StreamingContext(sc, Seconds(10))
>   ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory
>   ssc
> }
> val struct = StructType(StructField("log", StringType) ::Nil)
> // Get StreamingContext from checkpoint data or create a new one
> val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", 
> functionToCreateContext)
> val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topics)
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> SDB.foreachRDD(rdd => {
>   val result = rdd.map(item => {
> println(item)
> val result = item._2 match {
>   case e: String => Row.apply(e)
>   case _ => Row.apply("")
> }
> result
>   })
>   println(result.count())
>   val df = sqlContext.createDataFrame(result, struct)
>   df.insertIntoJDBC(url, "test", overwrite = false)
> })
> ssc.start()
> ssc.awaitTermination()
> ssc.stop()
> {code}
> But when I  recovery the program from checkpoint, I encountered an exception:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
> been initialized
>   at 
> org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
>   at 
> org.apache.spark.streaming.StreamingContext.start(S

[jira] [Commented] (SPARK-6772) spark sql error when running code on large number of records

2015-04-08 Thread Aditya Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486771#comment-14486771
 ] 

Aditya Parmar commented on SPARK-6772:
--

Hi Michael,

The same code works fine for a csv file with a thousand records

> spark sql error when running code on large number of records
> 
>
> Key: SPARK-6772
> URL: https://issues.apache.org/jira/browse/SPARK-6772
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Aditya Parmar
>
> Hi all ,
> I am getting an Arrayoutboundsindex error when i try to run a simple 
> filtering colums query on a file with 2.5 lac records.runs fine when running 
> on a file with 2k records .
> {code}
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 1]
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 2]
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 3]
> 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, 
> blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes)
> 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on 
> executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException 
> (null) [duplicate 4]
> 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; 
> aborting job
> 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0
> 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled
> 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at 
> JavaSchemaRDD.scala:42, took 1.914477 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com): 
> java.lang.ArrayIndexOutOfBoundsException
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [aditya@blrwfl11189 ~]$
>  {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-6704) integrate SparkR docs build tool into Spark doc build

2015-04-08 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486766#comment-14486766
 ] 

Shivaram Venkataraman commented on SPARK-6704:
--

This was actually fixed in https://github.com/apache/spark/pull/5096

> integrate SparkR docs build tool into Spark doc build
> -
>
> Key: SPARK-6704
> URL: https://issues.apache.org/jira/browse/SPARK-6704
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>Priority: Blocker
> Fix For: 1.4.0
>
>
> We should integrate the SparkR docs build tool into Spark one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6704) integrate SparkR docs build tool into Spark doc build

2015-04-08 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6704.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Shivaram Venkataraman

> integrate SparkR docs build tool into Spark doc build
> -
>
> Key: SPARK-6704
> URL: https://issues.apache.org/jira/browse/SPARK-6704
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>Assignee: Shivaram Venkataraman
>Priority: Blocker
> Fix For: 1.4.0
>
>
> We should integrate the SparkR docs build tool into Spark one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6798) Fix Date serialization in SparkR

2015-04-08 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-6798:


 Summary: Fix Date serialization in SparkR
 Key: SPARK-6798
 URL: https://issues.apache.org/jira/browse/SPARK-6798
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Davies Liu


SparkR's date serialization right now sends strings from R to the JVM. We 
should convert this to integers and also account for timezones correctly by 
using DateUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-08 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486763#comment-14486763
 ] 

Guoqiang Li commented on SPARK-3937:


I have encountered this issues in spark 1.3 .
 
My configuration spark-defaults.conf is :
{code:none}
spark.akka.frameSize 20
spark.akka.askTimeout120
spark.akka.timeout   120
spark.default.parallelism72
spark.locality.wait  1
spark.storage.blockManagerTimeoutIntervalMs  600
#spark.yarn.max.executor.failures 100
spark.core.connection.ack.wait.timeout 360
spark.storage.memoryFraction 0.2
spark.broadcast.factory org.apache.spark.broadcast.TorrentBroadcastFactory
#spark.broadcast.blockSize 8192
spark.driver.maxResultSize  4000
#spark.shuffle.blockTransferService nio
#spark.akka.heartbeat.interval 100
spark.kryoserializer.buffer.max.mb 256
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.spark.graphx.GraphKryoRegistrator
#spark.kryo.registrator org.apache.spark.mllib.clustering.LDAKryoRegistrator
{code}
{code:none}
java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code
at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:358)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:167)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:150)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
at com.esotericsoftware.kryo.io.Input.require(Input.java:169)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

> Unsafe memory access inside of Snappy library
> -
>
> Key: SPARK-3937
> URL: https://issues.apache.org/jira/browse/SPARK-3937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>
> This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
> have much information about this other than the stack trace. However, it was 
> concerning enough I figured I should post it.
> {code}
> java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code
> org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
> 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
> 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> java.io.Obje

[jira] [Comment Edited] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486736#comment-14486736
 ] 

yangping wu edited comment on SPARK-6770 at 4/9/15 5:51 AM:


Hi [~jerryshao],  Thank you for your reply. I've tried to put my streaming 
related logic into the function *functionToCreateContext*, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", StringType) ::Nil)
  val kafkaParams = Map[String, String]("metadata.broker.list" -> 
"192.168.100.11:9092,192.168.100.12:9092,192.168.100.13:9092",
"group.id" -> "test-consumer-group111")
  val url = 
"jdbc:mysql://192.168.100.10:3306/spark?user=admin&password=123456&useUnicode=true&characterEncoding=utf8&autoReconnect=true"

  val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, test)

  SDB.foreachRDD(rdd => {
val result = rdd.map(item => {
item._2 match {
  case e: String => Row.apply(e)
  case _ => Row.apply("")
}
})
try {
  println(result.count())
  val df = sqlContext.createDataFrame(result, struct)
  df.insertIntoJDBC(url, "testTable", overwrite = false)
} catch {
  case e: Exception => e.printStackTrace()
}
  })

  ssc
}

val ssc = StreamingContext.getOrCreate("/tmp/kafka/test/offset", 
functionToCreateContext)
ssc.start()
{code}
But when I recovery the program from checkpoint, I encountered an exception:
{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:217)
at 
org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:191)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:381)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:57)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:40)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{code}

It seems to be the SQLContext has not been initialized, so the *settings* is 
not initialized in the *org.apache.spark.sql.SQLConf*. then
{code}
private[spark] def dataFrameEagerAnalysis: Boolean =
getConf(DATAFRAME_EAGER_ANALYSIS, "true").toBoolean
{code}
throw java.lang.NullPointerException.



was (Author: 397090770):
Hi [~jerryshao],  Thank you for you reply. I've tried to put my streaming 
related logic into the function *functionToCreateContext*, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", St

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486760#comment-14486760
 ] 

yangping wu commented on SPARK-6770:


Yes, I think so. 
How to use other methods to solve it?

> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>   val sparkConf = new SparkConf()
>   val sc = new SparkContext(sparkConf)
>   val ssc = new StreamingContext(sc, Seconds(10))
>   ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory
>   ssc
> }
> val struct = StructType(StructField("log", StringType) ::Nil)
> // Get StreamingContext from checkpoint data or create a new one
> val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", 
> functionToCreateContext)
> val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topics)
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> SDB.foreachRDD(rdd => {
>   val result = rdd.map(item => {
> println(item)
> val result = item._2 match {
>   case e: String => Row.apply(e)
>   case _ => Row.apply("")
> }
> result
>   })
>   println(result.count())
>   val df = sqlContext.createDataFrame(result, struct)
>   df.insertIntoJDBC(url, "test", overwrite = false)
> })
> ssc.start()
> ssc.awaitTermination()
> ssc.stop()
> {code}
> But when I  recovery the program from checkpoint, I encountered an exception:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
> been initialized
>   at 
> org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
>   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
>   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.r

[jira] [Created] (SPARK-6797) Add support for YARN cluster mode

2015-04-08 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-6797:


 Summary: Add support for YARN cluster mode
 Key: SPARK-6797
 URL: https://issues.apache.org/jira/browse/SPARK-6797
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical


SparkR currently does not work in YARN cluster mode as the R package is not 
shipped along with the assembly jar to the YARN AM. We could try to use the 
support for archives in YARN to send out the R package as a zip file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5654) Integrate SparkR into Apache Spark

2015-04-08 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-5654:


Assignee: Shivaram Venkataraman

> Integrate SparkR into Apache Spark
> --
>
> Key: SPARK-5654
> URL: https://issues.apache.org/jira/browse/SPARK-5654
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 1.4.0
>
>
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
> from R. The project was started at the AMPLab around a year ago and has been 
> incubated as its own project to make sure it can be easily merged into 
> upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
> goals are similar to PySpark and shares a similar design pattern as described 
> in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark 
> out of the box and given R’s large user base, it will help the Spark project 
> reach more users.  Additionally, work in progress features like providing R 
> integration with ML Pipelines and Dataframes can be better achieved by 
> development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any 
> external dependencies other than requiring users to have R and Java installed 
> on their machines.  SparkR’s developers come from many organizations 
> including UC Berkeley, Alteryx, Intel and we will support future development, 
> maintenance after the integration.
> [1] https://github.com/amplab-extras/SparkR-pkg
> [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
> [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486754#comment-14486754
 ] 

Saisai Shao commented on SPARK-6770:


I guess SQLContext may not well support streaming checkpoint mechanism from my 
first glance, looks like SQLConf could not be recovered through checkpoint file.

> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>   val sparkConf = new SparkConf()
>   val sc = new SparkContext(sparkConf)
>   val ssc = new StreamingContext(sc, Seconds(10))
>   ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory
>   ssc
> }
> val struct = StructType(StructField("log", StringType) ::Nil)
> // Get StreamingContext from checkpoint data or create a new one
> val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", 
> functionToCreateContext)
> val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topics)
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> SDB.foreachRDD(rdd => {
>   val result = rdd.map(item => {
> println(item)
> val result = item._2 match {
>   case e: String => Row.apply(e)
>   case _ => Row.apply("")
> }
> result
>   })
>   println(result.count())
>   val df = sqlContext.createDataFrame(result, struct)
>   df.insertIntoJDBC(url, "test", overwrite = false)
> })
> ssc.start()
> ssc.awaitTermination()
> ssc.stop()
> {code}
> But when I  recovery the program from checkpoint, I encountered an exception:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
> been initialized
>   at 
> org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
>   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
>   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun

[jira] [Resolved] (SPARK-5654) Integrate SparkR into Apache Spark

2015-04-08 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-5654.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5096
[https://github.com/apache/spark/pull/5096]

> Integrate SparkR into Apache Spark
> --
>
> Key: SPARK-5654
> URL: https://issues.apache.org/jira/browse/SPARK-5654
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
> Fix For: 1.4.0
>
>
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
> from R. The project was started at the AMPLab around a year ago and has been 
> incubated as its own project to make sure it can be easily merged into 
> upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
> goals are similar to PySpark and shares a similar design pattern as described 
> in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark 
> out of the box and given R’s large user base, it will help the Spark project 
> reach more users.  Additionally, work in progress features like providing R 
> integration with ML Pipelines and Dataframes can be better achieved by 
> development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any 
> external dependencies other than requiring users to have R and Java installed 
> on their machines.  SparkR’s developers come from many organizations 
> including UC Berkeley, Alteryx, Intel and we will support future development, 
> maintenance after the integration.
> [1] https://github.com/amplab-extras/SparkR-pkg
> [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
> [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486736#comment-14486736
 ] 

yangping wu edited comment on SPARK-6770 at 4/9/15 5:36 AM:


Hi [~jerryshao],  Thank you for you reply. I've tried to put my streaming 
related logic into the function *functionToCreateContext*, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", StringType) ::Nil)
  val kafkaParams = Map[String, String]("metadata.broker.list" -> 
"192.168.100.11:9092,192.168.100.12:9092,192.168.100.13:9092",
"group.id" -> "test-consumer-group111")
  val url = 
"jdbc:mysql://192.168.100.10:3306/spark?user=admin&password=123456&useUnicode=true&characterEncoding=utf8&autoReconnect=true"

  val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, test)

  SDB.foreachRDD(rdd => {
val result = rdd.map(item => {
item._2 match {
  case e: String => Row.apply(e)
  case _ => Row.apply("")
}
})
try {
  println(result.count())
  val df = sqlContext.createDataFrame(result, struct)
  df.insertIntoJDBC(url, "testTable", overwrite = false)
} catch {
  case e: Exception => e.printStackTrace()
}
  })

  ssc
}

val ssc = StreamingContext.getOrCreate("/tmp/kafka/test/offset", 
functionToCreateContext)
ssc.start()
{code}
But when I recovery the program from checkpoint, I encountered an exception:
{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:217)
at 
org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:191)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:381)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:57)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:40)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{code}

It seems to be the SQLContext has not been initialized, so the *settings* is 
not initialized in the *org.apache.spark.sql.SQLConf*. then
{code}
private[spark] def dataFrameEagerAnalysis: Boolean =
getConf(DATAFRAME_EAGER_ANALYSIS, "true").toBoolean
{code}
throw java.lang.NullPointerException.



was (Author: 397090770):
Hi Saisai Shao,  Thank you for you reply. I've tried to put my streaming 
related logic into the function functionToCreateContext, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", String

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486737#comment-14486737
 ] 

yangping wu commented on SPARK-6770:


Hi Saisai Shao,  Thank you for you reply. I've tried to put my streaming 
related logic into the function functionToCreateContext, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", StringType) ::Nil)
  val kafkaParams = Map[String, String]("metadata.broker.list" -> 
"192.168.100.11:9092,192.168.100.12:9092,192.168.100.13:9092",
"group.id" -> "test-consumer-group111")
  val url = 
"jdbc:mysql://192.168.100.10:3306/spark?user=admin&password=123456&useUnicode=true&characterEncoding=utf8&autoReconnect=true"

  val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, test)

  SDB.foreachRDD(rdd => {
val result = rdd.map(item => {
item._2 match {
  case e: String => Row.apply(e)
  case _ => Row.apply("")
}
})
try {
  println(result.count())
  val df = sqlContext.createDataFrame(result, struct)
  df.insertIntoJDBC(url, "testTable", overwrite = false)
} catch {
  case e: Exception => e.printStackTrace()
}
  })

  ssc
}

val ssc = StreamingContext.getOrCreate("/tmp/kafka/test/offset", 
functionToCreateContext)
ssc.start()
{code}
But when I recovery the program from checkpoint, I encountered an exception:
{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:217)
at 
org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:191)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:381)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:57)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:40)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{code}

It seems to be the SQLContext has not been initialized, so the settings 
is not initialized in the org.apache.spark.sql.SQLConf. then
{code}
private[spark] def dataFrameEagerAnalysis: Boolean =
getConf(DATAFRAME_EAGER_ANALYSIS, "true").toBoolean
{code}
throw java.lang.NullPointerException.


> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486736#comment-14486736
 ] 

yangping wu commented on SPARK-6770:


Hi Saisai Shao,  Thank you for you reply. I've tried to put my streaming 
related logic into the function functionToCreateContext, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", StringType) ::Nil)
  val kafkaParams = Map[String, String]("metadata.broker.list" -> 
"192.168.100.11:9092,192.168.100.12:9092,192.168.100.13:9092",
"group.id" -> "test-consumer-group111")
  val url = 
"jdbc:mysql://192.168.100.10:3306/spark?user=admin&password=123456&useUnicode=true&characterEncoding=utf8&autoReconnect=true"

  val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, test)

  SDB.foreachRDD(rdd => {
val result = rdd.map(item => {
item._2 match {
  case e: String => Row.apply(e)
  case _ => Row.apply("")
}
})
try {
  println(result.count())
  val df = sqlContext.createDataFrame(result, struct)
  df.insertIntoJDBC(url, "testTable", overwrite = false)
} catch {
  case e: Exception => e.printStackTrace()
}
  })

  ssc
}

val ssc = StreamingContext.getOrCreate("/tmp/kafka/test/offset", 
functionToCreateContext)
ssc.start()
{code}
But when I recovery the program from checkpoint, I encountered an exception:
{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:217)
at 
org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:191)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:381)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:57)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:40)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{code}

It seems to be the SQLContext has not been initialized, so the settings 
is not initialized in the org.apache.spark.sql.SQLConf. then
{code}
private[spark] def dataFrameEagerAnalysis: Boolean =
getConf(DATAFRAME_EAGER_ANALYSIS, "true").toBoolean
{code}
throw java.lang.NullPointerException.


> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>

[jira] [Issue Comment Deleted] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread yangping wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangping wu updated SPARK-6770:
---
Comment: was deleted

(was: Hi Saisai Shao,  Thank you for you reply. I've tried to put my streaming 
related logic into the function functionToCreateContext, as follow:
{code}
def functionToCreateContext() = {
  val sparkConf = new SparkConf().setAppName("channelAnalyser")
  val sc = new SparkContext(sparkConf)
  val ssc = new StreamingContext(sc, Seconds(10))
  ssc.checkpoint("/tmp/kafka/test/offset")
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  val test = Set("test")
  val struct = StructType(StructField("log", StringType) ::Nil)
  val kafkaParams = Map[String, String]("metadata.broker.list" -> 
"192.168.100.11:9092,192.168.100.12:9092,192.168.100.13:9092",
"group.id" -> "test-consumer-group111")
  val url = 
"jdbc:mysql://192.168.100.10:3306/spark?user=admin&password=123456&useUnicode=true&characterEncoding=utf8&autoReconnect=true"

  val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, test)

  SDB.foreachRDD(rdd => {
val result = rdd.map(item => {
item._2 match {
  case e: String => Row.apply(e)
  case _ => Row.apply("")
}
})
try {
  println(result.count())
  val df = sqlContext.createDataFrame(result, struct)
  df.insertIntoJDBC(url, "testTable", overwrite = false)
} catch {
  case e: Exception => e.printStackTrace()
}
  })

  ssc
}

val ssc = StreamingContext.getOrCreate("/tmp/kafka/test/offset", 
functionToCreateContext)
ssc.start()
{code}
But when I recovery the program from checkpoint, I encountered an exception:
{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:217)
at 
org.apache.spark.sql.SQLConf.dataFrameEagerAnalysis(SQLConf.scala:191)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:381)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:57)
at 
logstatstreaming.FlightSearchTodb$$anonfun$logstatstreaming$FlightSearchTodb$$functionToCreateContext$1$1.apply(FlightSearchTodb.scala:40)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
{code}

It seems to be the SQLContext has not been initialized, so the settings 
is not initialized in the org.apache.spark.sql.SQLConf. then
{code}
private[spark] def dataFrameEagerAnalysis: Boolean =
getConf(DATAFRAME_EAGER_ANALYSIS, "true").toBoolean
{code}
throw java.lang.NullPointerException.
)

> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>   val sparkConf = new Sp

[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint

2015-04-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486712#comment-14486712
 ] 

Saisai Shao commented on SPARK-6770:


I think you have to put your streaming related logic into the function 
{{functionToCreateContext}}, you could refer to the related Spark Streaming 
example {{RecoverableNetworkWordCount}} to change your code. 

I think it is not a bug, you'd better try again.

> DirectKafkaInputDStream has not been initialized when recovery from checkpoint
> --
>
> Key: SPARK-6770
> URL: https://issues.apache.org/jira/browse/SPARK-6770
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>
> I am  read data from kafka using createDirectStream method and save the 
> received log to Mysql, the code snippets as follows
> {code}
> def functionToCreateContext(): StreamingContext = {
>   val sparkConf = new SparkConf()
>   val sc = new SparkContext(sparkConf)
>   val ssc = new StreamingContext(sc, Seconds(10))
>   ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory
>   ssc
> }
> val struct = StructType(StructField("log", StringType) ::Nil)
> // Get StreamingContext from checkpoint data or create a new one
> val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", 
> functionToCreateContext)
> val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topics)
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> SDB.foreachRDD(rdd => {
>   val result = rdd.map(item => {
> println(item)
> val result = item._2 match {
>   case e: String => Row.apply(e)
>   case _ => Row.apply("")
> }
> result
>   })
>   println(result.count())
>   val df = sqlContext.createDataFrame(result, struct)
>   df.insertIntoJDBC(url, "test", overwrite = false)
> })
> ssc.start()
> ssc.awaitTermination()
> ssc.stop()
> {code}
> But when I  recovery the program from checkpoint, I encountered an exception:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not 
> been initialized
>   at 
> org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266)
>   at 
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89)
>   at 
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512)
>   at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57)
>   at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   a

[jira] [Assigned] (SPARK-6796) Add the batch list to StreamingPage

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6796:
---

Assignee: Apache Spark

> Add the batch list to StreamingPage
> ---
>
> Key: SPARK-6796
> URL: https://issues.apache.org/jira/browse/SPARK-6796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming, Web UI
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Show the list of active and completed batches the StreamingPage, as the 
> proposed Task 1 in 
> https://docs.google.com/document/d/1-ZjvQ_2thWEQkTxRMHrVdnEI57XTi3wZEBUoqrrDg5c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6796) Add the batch list to StreamingPage

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6796:
---

Assignee: (was: Apache Spark)

> Add the batch list to StreamingPage
> ---
>
> Key: SPARK-6796
> URL: https://issues.apache.org/jira/browse/SPARK-6796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming, Web UI
>Reporter: Shixiong Zhu
>
> Show the list of active and completed batches the StreamingPage, as the 
> proposed Task 1 in 
> https://docs.google.com/document/d/1-ZjvQ_2thWEQkTxRMHrVdnEI57XTi3wZEBUoqrrDg5c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6796) Add the batch list to StreamingPage

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486668#comment-14486668
 ] 

Apache Spark commented on SPARK-6796:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5434

> Add the batch list to StreamingPage
> ---
>
> Key: SPARK-6796
> URL: https://issues.apache.org/jira/browse/SPARK-6796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming, Web UI
>Reporter: Shixiong Zhu
>
> Show the list of active and completed batches the StreamingPage, as the 
> proposed Task 1 in 
> https://docs.google.com/document/d/1-ZjvQ_2thWEQkTxRMHrVdnEI57XTi3wZEBUoqrrDg5c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6796) Add the batch list to StreamingPage

2015-04-08 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-6796:
---

 Summary: Add the batch list to StreamingPage
 Key: SPARK-6796
 URL: https://issues.apache.org/jira/browse/SPARK-6796
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming, Web UI
Reporter: Shixiong Zhu


Show the list of active and completed batches the StreamingPage, as the 
proposed Task 1 in 
https://docs.google.com/document/d/1-ZjvQ_2thWEQkTxRMHrVdnEI57XTi3wZEBUoqrrDg5c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6767) Documentation error in Spark SQL Readme file

2015-04-08 Thread Tijo Thomas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486650#comment-14486650
 ] 

Tijo Thomas commented on SPARK-6767:


Could you please change the status of this issue and assign this to me ?

> Documentation error in Spark SQL Readme file
> 
>
> Key: SPARK-6767
> URL: https://issues.apache.org/jira/browse/SPARK-6767
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Tijo Thomas
>Priority: Trivial
>
> Error in Spark SQL Documentation file . The sample script for SQL DSL   
> throwing below error
> scala> query.where('key > 30).select(avg('key)).collect()
> :43: error: value > is not a member of Symbol
>   query.where('key > 30).select(avg('key)).collect()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6795) Avoid reading Parquet footers on driver side when an global arbitrative schema is available

2015-04-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6795:
--
  Description: 
With the help of [Parquet MR PR 
#91|https://github.com/apache/incubator-parquet-mr/pull/91] which will be 
included in the official release of Parquet MR 1.6.0, now it's possible to 
avoid reading footers on the driver side completely when an global arbitrative 
schema is available.

Currently, the global schema can be either Hive metastore schema or specified 
via data sources DDL. All tasks should verify Parquet data files and reconcile 
possible schema conflicts locally against this global schema.

However, when no global schema is available and schema merging is enabled, we 
still need to read schemas from all data files to infer a valid global schema.
 Target Version/s: 1.4.0
Affects Version/s: 1.3.1
   1.1.1
   1.2.1
 Assignee: Cheng Lian

> Avoid reading Parquet footers on driver side when an global arbitrative 
> schema is available
> ---
>
> Key: SPARK-6795
> URL: https://issues.apache.org/jira/browse/SPARK-6795
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> With the help of [Parquet MR PR 
> #91|https://github.com/apache/incubator-parquet-mr/pull/91] which will be 
> included in the official release of Parquet MR 1.6.0, now it's possible to 
> avoid reading footers on the driver side completely when an global 
> arbitrative schema is available.
> Currently, the global schema can be either Hive metastore schema or specified 
> via data sources DDL. All tasks should verify Parquet data files and 
> reconcile possible schema conflicts locally against this global schema.
> However, when no global schema is available and schema merging is enabled, we 
> still need to read schemas from all data files to infer a valid global schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6794) Speed up default BroadcastHashJoin performance by using kryo-based SparkSerializer

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6794:
---

Assignee: Apache Spark

> Speed up default BroadcastHashJoin performance by using kryo-based 
> SparkSerializer
> --
>
> Key: SPARK-6794
> URL: https://issues.apache.org/jira/browse/SPARK-6794
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>Assignee: Apache Spark
>
> This won't matter if kryo is already used, but will make a speedup if it's 
> not. I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6794) Speed up default BroadcastHashJoin performance by using kryo-based SparkSerializer

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6794:
---

Assignee: (was: Apache Spark)

> Speed up default BroadcastHashJoin performance by using kryo-based 
> SparkSerializer
> --
>
> Key: SPARK-6794
> URL: https://issues.apache.org/jira/browse/SPARK-6794
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>
> This won't matter if kryo is already used, but will make a speedup if it's 
> not. I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6794) Speed up default BroadcastHashJoin performance by using kryo-based SparkSerializer

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486631#comment-14486631
 ] 

Apache Spark commented on SPARK-6794:
-

User 'vlyubin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5433

> Speed up default BroadcastHashJoin performance by using kryo-based 
> SparkSerializer
> --
>
> Key: SPARK-6794
> URL: https://issues.apache.org/jira/browse/SPARK-6794
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>
> This won't matter if kryo is already used, but will make a speedup if it's 
> not. I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6795) Avoid reading Parquet footers on driver side when an global arbitrative schema is available

2015-04-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-6795:
-

 Summary: Avoid reading Parquet footers on driver side when an 
global arbitrative schema is available
 Key: SPARK-6795
 URL: https://issues.apache.org/jira/browse/SPARK-6795
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-04-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486601#comment-14486601
 ] 

Josh Rosen commented on SPARK-5081:
---

I wrote a microbenchmark for snappy-java which shows an increased in compressed 
data sizes between the pre- and post-1.2 Snappy versions. I ran a bisect across 
published snappy-java releases and think that I've narrowed the problem down to 
a single patch.

I've opened https://github.com/xerial/snappy-java/issues/100 to investigate 
this upstream.

Note that this may not end up fully explaining this issue, since there could be 
multiple contributors to the shuffle file write size increase, but it seems 
suspicious.

> Shuffle write increases
> ---
>
> Key: SPARK-5081
> URL: https://issues.apache.org/jira/browse/SPARK-5081
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Kevin Jung
>Priority: Critical
> Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6765) Turn scalastyle on for test code

2015-04-08 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6765:
---
Component/s: Tests
 Project Infra

> Turn scalastyle on for test code
> 
>
> Key: SPARK-6765
> URL: https://issues.apache.org/jira/browse/SPARK-6765
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We should turn scalastyle on for test code. Test code should be as important 
> as main code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6783) Add timing and test output for PR tests

2015-04-08 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6783:
---
Component/s: Project Infra

> Add timing and test output for PR tests
> ---
>
> Key: SPARK-6783
> URL: https://issues.apache.org/jira/browse/SPARK-6783
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 1.3.0
>Reporter: Brennon York
>
> Currently the PR tests that run under {{dev/tests/*}} do not provide any 
> output within the actual Jenkins run. It would be nice to not only have error 
> output, but also timing results from each test and have those surfaced within 
> the Jenkins output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6794) Speed up default BroadcastHashJoin performance by using kryo-based SparkSerializer

2015-04-08 Thread Volodymyr Lyubinets (JIRA)

Volodymyr Lyubinets created SPARK-6794:
--

 Summary: Speed up default BroadcastHashJoin performance by using 
kryo-based SparkSerializer
 Key: SPARK-6794
 URL: https://issues.apache.org/jira/browse/SPARK-6794
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Volodymyr Lyubinets


This won't matter if kryo is already used, but will make a speedup if it's not. 
I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2015-04-08 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486595#comment-14486595
 ] 

Patrick Wendell commented on SPARK-6399:


It would be good to document more clearly what compatibility we intend to 
provide. I am not so sure that forward compatibility is a stated or necessary 
goal for binary interfaces. I think we should just provide backwards 
compatibility for those interfaces (though in practice these will almost always 
be the same except for some issues like this with implicits).

The main area we've had really strict enforcement of forward compatibility has 
been around the serialization format of JSON logs, since we want it to be easy 
for people to use the Spark history server with newer versions of Spark in a 
multi-tenant cluster.

> Code compiled against 1.3.0 may not run against older Spark versions
> 
>
> Key: SPARK-6399
> URL: https://issues.apache.org/jira/browse/SPARK-6399
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>
> Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
> easier to use. The problem is that scalac now generates code that will not 
> run on older Spark versions if those conversions are used.
> Basically, even if you explicitly import {{SparkContext._}}, scalac will 
> generate references to the new methods in the {{RDD}} object instead. So the 
> compiled code will reference code that doesn't exist in older versions of 
> Spark.
> You can work around this by explicitly calling the methods in the 
> {{SparkContext}} object, although that's a little ugly.
> We should at least document this limitation (if there's no way to fix it), 
> since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext

2015-04-08 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6778.

Resolution: Duplicate

> SQL contexts in spark-shell and pyspark should both be called sqlContext
> 
>
> Key: SPARK-6778
> URL: https://issues.apache.org/jira/browse/SPARK-6778
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Reporter: Matei Zaharia
>
> For some reason the Python one is only called sqlCtx. This is pretty 
> confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6784) Clean up all the inbound/outbound conversions for DateType

2015-04-08 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6784:
---
Component/s: SQL

> Clean up all the inbound/outbound conversions for DateType
> --
>
> Key: SPARK-6784
> URL: https://issues.apache.org/jira/browse/SPARK-6784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Priority: Blocker
>
> We had changed  the JvmType of DateType to Int, but there still some places 
> putting java.sql.Date into Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-04-08 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6785:
---
Component/s: SQL

> DateUtils can not handle date before 1970/01/01 correctly
> -
>
> Key: SPARK-6785
> URL: https://issues.apache.org/jira/browse/SPARK-6785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> {code}
> scala> val d = new Date(100)
> d: java.sql.Date = 1969-12-31
> scala> DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
> res1: java.sql.Date = 1970-01-01
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6792.

Resolution: Not A Problem

Resolving per Josh's comment.

> pySpark groupByKey returns rows with the same key
> -
>
> Key: SPARK-6792
> URL: https://issues.apache.org/jira/browse/SPARK-6792
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Charles Hayden
>
> Under some circumstances, pySpark groupByKey returns two or more rows with 
> the same groupby key.
> It is not reproducible by a short example, but it can be seen in the 
> following program.
> The preservesPartitioning argument is required to see the failure.
> I ran this  with cluster_url=local[4], but I think it will also show up with 
> cluster_url=local.
> =
> {noformat}
> # The RDD.groupByKey sometimes gives two results with the same   key 
> value.  This is incorrect: all results with a single key need to be grouped 
> together.
> # Report the spark version
> from pyspark import SparkContext
> import StringIO
> import csv
> sc = SparkContext()
> print sc.version
> def loadRecord(line):
> input = StringIO.StringIO(line)
> reader = csv.reader(input, delimiter='\t')
> return reader.next()
> # Read data from movielens dataset
> # This can be obtained from 
> http://files.grouplens.org/datasets/movielens/ml-100k.zip
> inputFile = 'u.data'
> input = sc.textFile(inputFile)
> data = input.map(loadRecord)
> # Trim off unneeded fields
> data = data.map(lambda row: row[0:2])
> print 'Data Sample'
> print data.take(10)
> # Use join to filter the data
> #
> # map bulds left key
> # map builds right key
> # join
> # map throws away the key and gets result
> # pick a couple of users
> j = sc.parallelize([789, 939])
> # key left
> # conversion to str is required to show the error
> keyed_j = j.map(lambda row: (str(row), None))
> # key right
> keyed_rdd = data.map(lambda row: (str(row[0]), row))
> # join
> joined = keyed_rdd.join(keyed_j)
> # throw away key
> # preservesPartitioning is required to show the error
> res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
> #res = joined.map(lambda row: row[1][0])  # no error
> print 'Filtered Sample'
> print res.take(10)
> #print res.count()
> # Do the groupby
> # There should be fewer rows
> keyed_rdd = res.map(lambda row: (row[1], row), 
> preservesPartitioning=True)
> print 'Input Count', keyed_rdd.count()
> grouped_rdd = keyed_rdd.groupByKey()
> print 'Grouped Count', grouped_rdd.count()
> # There are two rows with the same key !
> print 'Group Output Sample'
> print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6696) HiveContext.refreshTable is missing in PySpark

2015-04-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6696.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5349
[https://github.com/apache/spark/pull/5349]

> HiveContext.refreshTable is missing in PySpark
> --
>
> Key: SPARK-6696
> URL: https://issues.apache.org/jira/browse/SPARK-6696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6451) Support CombineSum in Code Gen

2015-04-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6451.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5138
[https://github.com/apache/spark/pull/5138]

> Support CombineSum in Code Gen
> --
>
> Key: SPARK-6451
> URL: https://issues.apache.org/jira/browse/SPARK-6451
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Venkata Ramana G
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Since we are using CombineSum at the reducer side for the SUM function, we 
> need to make it work in code gen. Otherwise, code gen will not convert 
> Aggregates with a SUM function to GeneratedAggregates (the code gen version).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-6479:
--
Attachment: spark-6479-tachyon.patch

patch with Tachyon migration. Not complete patch, as it will add too much noise 
for just rename tachyon to offheap.

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf, 
> spark-6479-tachyon.patch
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6766) StreamingListenerBatchSubmitted isn't sent and StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value

2015-04-08 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-6766:
-
Assignee: Shixiong Zhu

> StreamingListenerBatchSubmitted isn't sent and 
> StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value
> ---
>
> Key: SPARK-6766
> URL: https://issues.apache.org/jira/browse/SPARK-6766
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> 1. Now there is no place posting StreamingListenerBatchSubmitted. It should 
> be post when JobSet is submitted.
> 2. Call JobSet.handleJobStart before posting StreamingListenerBatchStarted 
> will set StreamingListenerBatchStarted.batchInfo.processingStartTime to None, 
> which should have been set to a correct value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4705:
---

Assignee: (was: Apache Spark)

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - 
> II.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4705:
---

Assignee: Apache Spark

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - 
> II.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486434#comment-14486434
 ] 

Apache Spark commented on SPARK-4705:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5432

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - 
> II.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5957) Better handling of default parameter values.

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486432#comment-14486432
 ] 

Apache Spark commented on SPARK-5957:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5431

> Better handling of default parameter values.
> 
>
> Key: SPARK-5957
> URL: https://issues.apache.org/jira/browse/SPARK-5957
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We store the default value of a parameter in the Param instance. In many 
> cases, the default value depends on the algorithm and other parameters 
> defined in the same algorithm. We need to think a better approach to handle 
> default parameter values.
> The design doc was posted in the parent JIRA: 
> https://issues.apache.org/jira/browse/SPARK-5874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486398#comment-14486398
 ] 

Josh Rosen commented on SPARK-6792:
---

I think that this is a misuse of {{preservesPartitioning}} and not a Spark bug. 
 I think the problem occurs in this line:

{code}
keyed_rdd = res.map(lambda row: (row[1], row), preservesPartitioning=True)
{code}

Up until this point, you have RDDs that were partitioned based on the first 
column of the record, but here I think you're taking the second column of the 
row and using it as a key.  I think that this example will work correctly if 
you omit this final {{preservesPartitioning}} call, since this map is changing 
the key column.

> pySpark groupByKey returns rows with the same key
> -
>
> Key: SPARK-6792
> URL: https://issues.apache.org/jira/browse/SPARK-6792
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Charles Hayden
>
> Under some circumstances, pySpark groupByKey returns two or more rows with 
> the same groupby key.
> It is not reproducible by a short example, but it can be seen in the 
> following program.
> The preservesPartitioning argument is required to see the failure.
> I ran this  with cluster_url=local[4], but I think it will also show up with 
> cluster_url=local.
> =
> {noformat}
> # The RDD.groupByKey sometimes gives two results with the same   key 
> value.  This is incorrect: all results with a single key need to be grouped 
> together.
> # Report the spark version
> from pyspark import SparkContext
> import StringIO
> import csv
> sc = SparkContext()
> print sc.version
> def loadRecord(line):
> input = StringIO.StringIO(line)
> reader = csv.reader(input, delimiter='\t')
> return reader.next()
> # Read data from movielens dataset
> # This can be obtained from 
> http://files.grouplens.org/datasets/movielens/ml-100k.zip
> inputFile = 'u.data'
> input = sc.textFile(inputFile)
> data = input.map(loadRecord)
> # Trim off unneeded fields
> data = data.map(lambda row: row[0:2])
> print 'Data Sample'
> print data.take(10)
> # Use join to filter the data
> #
> # map bulds left key
> # map builds right key
> # join
> # map throws away the key and gets result
> # pick a couple of users
> j = sc.parallelize([789, 939])
> # key left
> # conversion to str is required to show the error
> keyed_j = j.map(lambda row: (str(row), None))
> # key right
> keyed_rdd = data.map(lambda row: (str(row[0]), row))
> # join
> joined = keyed_rdd.join(keyed_j)
> # throw away key
> # preservesPartitioning is required to show the error
> res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
> #res = joined.map(lambda row: row[1][0])  # no error
> print 'Filtered Sample'
> print res.take(10)
> #print res.count()
> # Do the groupby
> # There should be fewer rows
> keyed_rdd = res.map(lambda row: (row[1], row), 
> preservesPartitioning=True)
> print 'Input Count', keyed_rdd.count()
> grouped_rdd = keyed_rdd.groupByKey()
> print 'Grouped Count', grouped_rdd.count()
> # There are two rows with the same key !
> print 'Group Output Sample'
> print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6793) Implement perplexity for LDA

2015-04-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6793:
-
Description: LDA should be able to compute perplexity.  This JIRA is for 
computing it on the training dataset.  See the linked JIRA for computing it on 
a new corpus: [SPARK-5567]  (was: LDA does not currently support prediction on 
new datasets.  It should.  The prediction methods should include:
* Computing topic distributions for new documents
* Computing data metrics: log likelihood, perplexity

This task should probably be split up into sub-tasks for each prediction 
method, though we should think about whether code should be shared (and whether 
the return type should be able to produce all of these results since they 
require similar computation).)

> Implement perplexity for LDA
> 
>
> Key: SPARK-6793
> URL: https://issues.apache.org/jira/browse/SPARK-6793
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> LDA should be able to compute perplexity.  This JIRA is for computing it on 
> the training dataset.  See the linked JIRA for computing it on a new corpus: 
> [SPARK-5567]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6793) Implement perplexity for LDA

2015-04-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6793:
-
Summary: Implement perplexity for LDA  (was: Implement prediction methods 
for LDA)

> Implement perplexity for LDA
> 
>
> Key: SPARK-6793
> URL: https://issues.apache.org/jira/browse/SPARK-6793
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> LDA does not currently support prediction on new datasets.  It should.  The 
> prediction methods should include:
> * Computing topic distributions for new documents
> * Computing data metrics: log likelihood, perplexity
> This task should probably be split up into sub-tasks for each prediction 
> method, though we should think about whether code should be shared (and 
> whether the return type should be able to produce all of these results since 
> they require similar computation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6793) Implement prediction methods for LDA

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6793:


 Summary: Implement prediction methods for LDA
 Key: SPARK-6793
 URL: https://issues.apache.org/jira/browse/SPARK-6793
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


LDA does not currently support prediction on new datasets.  It should.  The 
prediction methods should include:
* Computing topic distributions for new documents
* Computing data metrics: log likelihood, perplexity

This task should probably be split up into sub-tasks for each prediction 
method, though we should think about whether code should be shared (and whether 
the return type should be able to produce all of these results since they 
require similar computation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Charles Hayden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Hayden updated SPARK-6792:
--
Description: 
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{noformat}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{noformat}

  was:
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{quote}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{quote}


> pySpark group

[jira] [Updated] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Charles Hayden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Hayden updated SPARK-6792:
--
Description: 
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{noformat}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{noformat}

  was:
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{noformat}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{noformat}

[jira] [Assigned] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6479:
---

Assignee: (was: Apache Spark)

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486335#comment-14486335
 ] 

Apache Spark commented on SPARK-6479:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/5430

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6479:
---

Assignee: Apache Spark

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Charles Hayden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Hayden updated SPARK-6792:
--
Description: 
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{quote}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{quote}

  was:
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{quote}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{quote}


> pySpark groupByKey retu

[jira] [Updated] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Charles Hayden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Hayden updated SPARK-6792:
--
Description: 
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=
{quote}
# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
{quote}

  was:
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=

# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)



> pySpark groupByKey returns rows with the

[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486332#comment-14486332
 ] 

Apache Spark commented on SPARK-4925:
-

User 'chernetsov' has created a pull request for this issue:
https://github.com/apache/spark/pull/5429

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Alex Liu
>Assignee: Patrick Wendell
>Priority: Critical
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Charles Hayden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Hayden updated SPARK-6792:
--
Description: 
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=

# The RDD.groupByKey sometimes gives two results with the same   key value. 
 This is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), 
preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)


  was:
Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=

# The RDD.groupByKey sometimes gives two results with the same key value.  This 
is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)



> pySpark groupByKey returns rows with the same key
> -
>
> Key: SPARK-6792
> URL: https://issues.apache.org/jira/browse/SPARK-6792
> Project: Spark
>  Issu

[jira] [Commented] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-04-08 Thread Misha Chernetsov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486331#comment-14486331
 ] 

Misha Chernetsov commented on SPARK-4925:
-

[~pwendell] Here you go! https://github.com/apache/spark/pull/5429

> Publish Spark SQL hive-thriftserver maven artifact 
> ---
>
> Key: SPARK-4925
> URL: https://issues.apache.org/jira/browse/SPARK-4925
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Alex Liu
>Assignee: Patrick Wendell
>Priority: Critical
>
> The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
> Cassandra.
> Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2015-04-08 Thread Shuo Xiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486321#comment-14486321
 ] 

Shuo Xiang commented on SPARK-3987:
---

[~debasish83] could you point me to those test cases?

> NNLS generates incorrect result
> ---
>
> Key: SPARK-3987
> URL: https://issues.apache.org/jira/browse/SPARK-3987
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Debasish Das
>Assignee: Shuo Xiang
> Fix For: 1.1.1, 1.2.0
>
>
> Hi,
> Please see the example gram matrix and linear term:
> val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 
> 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, 
> -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, 
> -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, 
> -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, 
> -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 
> 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 
> 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 
> 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 
> 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, 
> -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, 
> -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, 
> -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, 
> -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 
> 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 
> 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 
> 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, 
> -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 
> 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 
> 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 
> 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 
> 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, 
> -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, 
> -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 
> 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 
> 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, 
> -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, 
> -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, 
> -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 
> 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, 
> -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, 
> -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, 
> -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, 
> -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 
> 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 
> 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 
> 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, 
> -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 
> 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 
> 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 
> 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, 
> -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 
> 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 
> 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 
> 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 
> 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, 
> -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, 
> -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, 
> -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 
> 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, 
> -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, 
> -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, 
> -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, 
> -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 
> 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 
> 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 
> 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592, 
>

[jira] [Created] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Charles Hayden (JIRA)

Charles Hayden created SPARK-6792:
-

 Summary: pySpark groupByKey returns rows with the same key
 Key: SPARK-6792
 URL: https://issues.apache.org/jira/browse/SPARK-6792
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Charles Hayden


Under some circumstances, pySpark groupByKey returns two or more rows with the 
same groupby key.

It is not reproducible by a short example, but it can be seen in the following 
program.
The preservesPartitioning argument is required to see the failure.
I ran this  with cluster_url=local[4], but I think it will also show up with 
cluster_url=local.
=

# The RDD.groupByKey sometimes gives two results with the same key value.  This 
is incorrect: all results with a single key need to be grouped together.

# Report the spark version
from pyspark import SparkContext
import StringIO
import csv
sc = SparkContext()
print sc.version


def loadRecord(line):
input = StringIO.StringIO(line)
reader = csv.reader(input, delimiter='\t')
return reader.next()

# Read data from movielens dataset
# This can be obtained from 
http://files.grouplens.org/datasets/movielens/ml-100k.zip
inputFile = 'u.data'
input = sc.textFile(inputFile)
data = input.map(loadRecord)
# Trim off unneeded fields
data = data.map(lambda row: row[0:2])
print 'Data Sample'
print data.take(10)


# Use join to filter the data
#
# map bulds left key
# map builds right key
# join
# map throws away the key and gets result

# pick a couple of users
j = sc.parallelize([789, 939])
# key left
# conversion to str is required to show the error
keyed_j = j.map(lambda row: (str(row), None))
# key right
keyed_rdd = data.map(lambda row: (str(row[0]), row))
# join
joined = keyed_rdd.join(keyed_j)
# throw away key
# preservesPartitioning is required to show the error
res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
#res = joined.map(lambda row: row[1][0])  # no error

print 'Filtered Sample'
print res.take(10)
#print res.count()


# Do the groupby
# There should be fewer rows
keyed_rdd = res.map(lambda row: (row[1], row), preservesPartitioning=True)
print 'Input Count', keyed_rdd.count()
grouped_rdd = keyed_rdd.groupByKey()
print 'Grouped Count', grouped_rdd.count()


# There are two rows with the same key !
print 'Group Output Sample'
print grouped_rdd.filter(lambda row: row[0] == '508').take(10)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486296#comment-14486296
 ] 

Joseph K. Bradley commented on SPARK-6725:
--

Linked [https://issues.apache.org/jira/browse/SPARK-5874] since it is related, 
but it is unclear to me whether it will affect implementing model import/export.

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: meta-algorithms

2015-04-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6791:
-
Description: 
Algorithms: Pipeline, CrossValidator (and associated models)

This task will block on all other subtasks for [SPARK-6725].  This task will 
also include adding export/import as a required part of the PipelineStage 
interface since meta-algorithms will depend on sub-algorithms supporting 
save/load.

> Model export/import for spark.ml: meta-algorithms
> -
>
> Key: SPARK-6791
> URL: https://issues.apache.org/jira/browse/SPARK-6791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Algorithms: Pipeline, CrossValidator (and associated models)
> This task will block on all other subtasks for [SPARK-6725].  This task will 
> also include adding export/import as a required part of the PipelineStage 
> interface since meta-algorithms will depend on sub-algorithms supporting 
> save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: meta-algorithms

2015-04-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6791:
-
Summary: Model export/import for spark.ml: meta-algorithms  (was: Model 
export/import for spark.ml: CrossValidator)

> Model export/import for spark.ml: meta-algorithms
> -
>
> Key: SPARK-6791
> URL: https://issues.apache.org/jira/browse/SPARK-6791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6791) Model export/import for spark.ml: CrossValidator

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6791:


 Summary: Model export/import for spark.ml: CrossValidator
 Key: SPARK-6791
 URL: https://issues.apache.org/jira/browse/SPARK-6791
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6789) Model export/import for spark.ml: ALS

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6789:


 Summary: Model export/import for spark.ml: ALS
 Key: SPARK-6789
 URL: https://issues.apache.org/jira/browse/SPARK-6789
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6790) Model export/import for spark.ml: LinearRegression

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6790:


 Summary: Model export/import for spark.ml: LinearRegression
 Key: SPARK-6790
 URL: https://issues.apache.org/jira/browse/SPARK-6790
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6788) Model export/import for spark.ml: Tokenizer

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6788:


 Summary: Model export/import for spark.ml: Tokenizer
 Key: SPARK-6788
 URL: https://issues.apache.org/jira/browse/SPARK-6788
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6787) Model export/import for spark.ml: StandardScaler

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6787:


 Summary: Model export/import for spark.ml: StandardScaler
 Key: SPARK-6787
 URL: https://issues.apache.org/jira/browse/SPARK-6787
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6786) Model export/import for spark.ml: Normalizer

2015-04-08 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6786:


 Summary: Model export/import for spark.ml: Normalizer
 Key: SPARK-6786
 URL: https://issues.apache.org/jira/browse/SPARK-6786
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5114) Should Evaluator be a PipelineStage

2015-04-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5114:
-
Description: 
Pipelines can currently contain Estimators and Transformers.

Question for debate: Should Pipelines be able to contain Evaluators?

Pros:
* Schema check: Evaluators take input datasets with particular schema, which 
should perhaps be checked before running a Pipeline.
* Intermediate results:
** If a Transformer removes a column (which is not done by built-in 
Transformers currently but might be reasonable in the future), then the user 
can never evaluate that column.  (However, users could keep all columns around.)
** If users have to evaluate after running a Pipeline, then each evaluated 
column may have to be re-materialized.

Cons:
* Evaluators do not transform datasets.   They produce a scalar (or a few 
values), which makes it hard to say how they fit into a Pipeline or a 
PipelineModel.


  was:
Pipelines can currently contain Estimators and Transformers.

Question for debate: Should Pipelines be able to contain Evaluators?

Pros:
* Evaluators take input datasets with particular schema, which should perhaps 
be checked before running a Pipeline.

Cons:
* Evaluators do not transform datasets.   They produce a scalar (or a few 
values), which makes it hard to say how they fit into a Pipeline or a 
PipelineModel.


> Should Evaluator be a PipelineStage
> ---
>
> Key: SPARK-5114
> URL: https://issues.apache.org/jira/browse/SPARK-5114
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> Pipelines can currently contain Estimators and Transformers.
> Question for debate: Should Pipelines be able to contain Evaluators?
> Pros:
> * Schema check: Evaluators take input datasets with particular schema, which 
> should perhaps be checked before running a Pipeline.
> * Intermediate results:
> ** If a Transformer removes a column (which is not done by built-in 
> Transformers currently but might be reasonable in the future), then the user 
> can never evaluate that column.  (However, users could keep all columns 
> around.)
> ** If users have to evaluate after running a Pipeline, then each evaluated 
> column may have to be re-materialized.
> Cons:
> * Evaluators do not transform datasets.   They produce a scalar (or a few 
> values), which makes it hard to say how they fit into a Pipeline or a 
> PipelineModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5114) Should Evaluator be a PipelineStage

2015-04-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5114:
-
Description: 
Pipelines can currently contain Estimators and Transformers.

Question for debate: Should Pipelines be able to contain Evaluators?

Pros:
* Schema check: Evaluators take input datasets with particular schema, which 
should perhaps be checked before running a Pipeline.
* Intermediate results:
** If a Transformer removes a column (which is not done by built-in 
Transformers currently but might be reasonable in the future), then the user 
can never evaluate that column.  (However, users could keep all columns around.)
** If users have to evaluate after running a Pipeline, then each evaluated 
column may have to be re-materialized.

Cons:
* API: Evaluators do not transform datasets.   They produce a scalar (or a few 
values), which makes it hard to say how they fit into a Pipeline or a 
PipelineModel.


  was:
Pipelines can currently contain Estimators and Transformers.

Question for debate: Should Pipelines be able to contain Evaluators?

Pros:
* Schema check: Evaluators take input datasets with particular schema, which 
should perhaps be checked before running a Pipeline.
* Intermediate results:
** If a Transformer removes a column (which is not done by built-in 
Transformers currently but might be reasonable in the future), then the user 
can never evaluate that column.  (However, users could keep all columns around.)
** If users have to evaluate after running a Pipeline, then each evaluated 
column may have to be re-materialized.

Cons:
* Evaluators do not transform datasets.   They produce a scalar (or a few 
values), which makes it hard to say how they fit into a Pipeline or a 
PipelineModel.



> Should Evaluator be a PipelineStage
> ---
>
> Key: SPARK-5114
> URL: https://issues.apache.org/jira/browse/SPARK-5114
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> Pipelines can currently contain Estimators and Transformers.
> Question for debate: Should Pipelines be able to contain Evaluators?
> Pros:
> * Schema check: Evaluators take input datasets with particular schema, which 
> should perhaps be checked before running a Pipeline.
> * Intermediate results:
> ** If a Transformer removes a column (which is not done by built-in 
> Transformers currently but might be reasonable in the future), then the user 
> can never evaluate that column.  (However, users could keep all columns 
> around.)
> ** If users have to evaluate after running a Pipeline, then each evaluated 
> column may have to be re-materialized.
> Cons:
> * API: Evaluators do not transform datasets.   They produce a scalar (or a 
> few values), which makes it hard to say how they fit into a Pipeline or a 
> PipelineModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5256) Improving MLlib optimization APIs

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486206#comment-14486206
 ] 

Joseph K. Bradley edited comment on SPARK-5256 at 4/8/15 10:41 PM:
---

*Q*: Should Optimizer store Gradient and Updater?

*Proposal*: No.  Gradient and Updater (regularization type) are model 
parameters, not Optimizer parameters.  The generalized linear algorithm should 
take Optimizer, Gradient, and regularization type as separate parameters.  
Internally, the GLM can pass the Gradient and reg type to the Optimizer, either 
as method parameters:
{code}
Optimizer.step(currentWeights, gradient, regType)
{code}
or by constructing a specific optimizer
{code}
val optimizer = new Optimizer(gradient, regType)
newWeights = optimizer.step(currentWeights)
{code}

*Another note*: [~avulanov] pointed out in the dev list that, in general, the 
Gradient and Updater do need to be tightly coupled so that both know which 
weight is the intercept/bias term (and not regularized).  If the GLM takes both 
as parameters as in this proposal, it could be responsible for informing the 
Gradient and Updater of which weight is the intercept.


was (Author: josephkb):
*Q*: Should Optimizer store Gradient and Updater?

*Proposal*: No.  Gradient and Updater (regularization type) are model 
parameters, not Optimizer parameters.  The generalized linear algorithm should 
take Optimizer, Gradient, and regularization type as separate parameters.  
Internally, the GLM can pass the Gradient and reg type to the Optimizer, either 
as method parameters:
{code}
Optimizer.step(currentWeights, gradient, regType)
{code}
or by constructing a specific optimizer
{code}
val optimizer = new Optimizer(gradient, regType)
newWeights = optimizer.step(currentWeights)
{code}

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6752:
---

Assignee: Tathagata Das  (was: Apache Spark)

> Allow StreamingContext to be recreated from checkpoint and existing 
> SparkContext
> 
>
> Key: SPARK-6752
> URL: https://issues.apache.org/jira/browse/SPARK-6752
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.1, 1.2.1, 1.3.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Currently if you want to create a StreamingContext from checkpoint 
> information, the system will create a new SparkContext. This prevent 
> StreamingContext to be recreated from checkpoints in managed environments 
> where SparkContext is precreated.
> Proposed solution: Introduce the following methods on StreamingContext
> 1. {{new StreamingContext(checkpointDirectory, sparkContext)}}
> - Recreate StreamingContext from checkpoint using the provided SparkContext
> 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}}
> - Recreate StreamingContext from checkpoint using the provided SparkContext 
> and hadoop conf to read the checkpoint
> 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, 
> createFunction: SparkContext => StreamingContext)}}
> - If checkpoint file exists, then recreate StreamingContext using the 
> provided SparkContext (that is, call 1.), else create StreamingContext using 
> the provided createFunction
> Also, the corresponding Java and Python API has to be added as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6752:
---

Assignee: Apache Spark  (was: Tathagata Das)

> Allow StreamingContext to be recreated from checkpoint and existing 
> SparkContext
> 
>
> Key: SPARK-6752
> URL: https://issues.apache.org/jira/browse/SPARK-6752
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.1, 1.2.1, 1.3.1
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> Currently if you want to create a StreamingContext from checkpoint 
> information, the system will create a new SparkContext. This prevent 
> StreamingContext to be recreated from checkpoints in managed environments 
> where SparkContext is precreated.
> Proposed solution: Introduce the following methods on StreamingContext
> 1. {{new StreamingContext(checkpointDirectory, sparkContext)}}
> - Recreate StreamingContext from checkpoint using the provided SparkContext
> 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}}
> - Recreate StreamingContext from checkpoint using the provided SparkContext 
> and hadoop conf to read the checkpoint
> 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, 
> createFunction: SparkContext => StreamingContext)}}
> - If checkpoint file exists, then recreate StreamingContext using the 
> provided SparkContext (that is, call 1.), else create StreamingContext using 
> the provided createFunction
> Also, the corresponding Java and Python API has to be added as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext

2015-04-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486228#comment-14486228
 ] 

Apache Spark commented on SPARK-6752:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/5428

> Allow StreamingContext to be recreated from checkpoint and existing 
> SparkContext
> 
>
> Key: SPARK-6752
> URL: https://issues.apache.org/jira/browse/SPARK-6752
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.1, 1.2.1, 1.3.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Currently if you want to create a StreamingContext from checkpoint 
> information, the system will create a new SparkContext. This prevent 
> StreamingContext to be recreated from checkpoints in managed environments 
> where SparkContext is precreated.
> Proposed solution: Introduce the following methods on StreamingContext
> 1. {{new StreamingContext(checkpointDirectory, sparkContext)}}
> - Recreate StreamingContext from checkpoint using the provided SparkContext
> 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}}
> - Recreate StreamingContext from checkpoint using the provided SparkContext 
> and hadoop conf to read the checkpoint
> 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, 
> createFunction: SparkContext => StreamingContext)}}
> - If checkpoint file exists, then recreate StreamingContext using the 
> provided SparkContext (that is, call 1.), else create StreamingContext using 
> the provided createFunction
> Also, the corresponding Java and Python API has to be added as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-04-08 Thread Davies Liu (JIRA)

Davies Liu created SPARK-6785:
-

 Summary: DateUtils can not handle date before 1970/01/01 correctly
 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
scala> val d = new Date(100)
d: java.sql.Date = 1969-12-31

scala> DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
res1: java.sql.Date = 1970-01-01

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5969:
---

Assignee: Apache Spark

> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False.
> 
>
> Key: SPARK-5969
> URL: https://issues.apache.org/jira/browse/SPARK-5969
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.1
> Environment: Linux, 64bit
>Reporter: Milan Straka
>Assignee: Apache Spark
>
> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False -- the first one and the last one.
> Simple example sorting numbers 0..999 into 10 partitions in descending order:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, 
> numPartitions=10).glom().map(len).collect()
> {code}
> returns the following partition sizes:
> {code}
> [469, 0, 0, 0, 0, 0, 0, 0, 0, 531]
> {code}
> When ascending=True, all works as expected:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, 
> numPartitions=10).glom().map(len).collect()
> Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85]
> {code}
> The problem is caused by the following line 565 in rdd.py:
> {code}
> samples = sorted(samples, reverse=(not ascending), key=keyfunc)
> {code}
> That sorts the samples descending if ascending=False. Nevertheless samples 
> should always be in ascending order, because it is (after subsampling to 
> variable bounds) used in a bisect_left call:
> {code}
> def rangePartitioner(k):
> p = bisect.bisect_left(bounds, keyfunc(k))
> if ascending:
> return p
> else:
> return numPartitions - 1 - p
> {code}
> As you can see, rangePartitioner already handles the ascending=False by 
> itself, so the fix for the whole problem is really trivial: just change line 
> rdd.py:565 to
> {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.

2015-04-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5969:
---

Assignee: (was: Apache Spark)

> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False.
> 
>
> Key: SPARK-5969
> URL: https://issues.apache.org/jira/browse/SPARK-5969
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.1
> Environment: Linux, 64bit
>Reporter: Milan Straka
>
> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False -- the first one and the last one.
> Simple example sorting numbers 0..999 into 10 partitions in descending order:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, 
> numPartitions=10).glom().map(len).collect()
> {code}
> returns the following partition sizes:
> {code}
> [469, 0, 0, 0, 0, 0, 0, 0, 0, 531]
> {code}
> When ascending=True, all works as expected:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, 
> numPartitions=10).glom().map(len).collect()
> Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85]
> {code}
> The problem is caused by the following line 565 in rdd.py:
> {code}
> samples = sorted(samples, reverse=(not ascending), key=keyfunc)
> {code}
> That sorts the samples descending if ascending=False. Nevertheless samples 
> should always be in ascending order, because it is (after subsampling to 
> variable bounds) used in a bisect_left call:
> {code}
> def rangePartitioner(k):
> p = bisect.bisect_left(bounds, keyfunc(k))
> if ascending:
> return p
> else:
> return numPartitions - 1 - p
> {code}
> As you can see, rangePartitioner already handles the ascending=False by 
> itself, so the fix for the whole problem is really trivial: just change line 
> rdd.py:565 to
> {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486218#comment-14486218
 ] 

Reynold Xin commented on SPARK-6479:


Looks good. Please submit a pull request. Thanks.

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486206#comment-14486206
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

*Q*: Should Optimizer store Gradient and Updater?

*Proposal*: No.  Gradient and Updater (regularization type) are model 
parameters, not Optimizer parameters.  The generalized linear algorithm should 
take Optimizer, Gradient, and regularization type as separate parameters.  
Internally, the GLM can pass the Gradient and reg type to the Optimizer, either 
as method parameters:
{code}
Optimizer.step(currentWeights, gradient, regType)
{code}
or by constructing a specific optimizer
{code}
val optimizer = new Optimizer(gradient, regType)
newWeights = optimizer.step(currentWeights)
{code}

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486196#comment-14486196
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

*Q*: Do we like the "Updater" concept?

*Proposal*: No.  It conflates the regularization type with the 
regularization-related update.  The regularization type should be a model 
parameter.  The update function should depend on the model's regularization 
type and the optimizer.  There are only two such update functions we need 
currently: (sub)gradient step (for L1 or L2) and projection (for L1).  We could 
add more later.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486184#comment-14486184
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

(Comment related to link to [SPARK-6682]) Builder methods for GLMs have issues 
because of the Optimizer API.  (See discussion above: The constructors for 
Optimizer require Gradient and Updater.)  Cleaning up the Optimizer API will 
facilitate moving to the builder API.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486183#comment-14486183
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

Builder methods for GLMs have issues because of the Optimizer API.  (See 
discussion above: The constructors for Optimizer require Gradient and Updater.) 
 Cleaning up the Optimizer API will facilitate moving to the builder API.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486183#comment-14486183
 ] 

Joseph K. Bradley edited comment on SPARK-6682 at 4/8/15 10:15 PM:
---

(Comment related to link to [SPARK-5256]) Builder methods for GLMs have issues 
because of the Optimizer API.  (See discussion above: The constructors for 
Optimizer require Gradient and Updater.)  Cleaning up the Optimizer API will 
facilitate moving to the builder API.


was (Author: josephkb):
Builder methods for GLMs have issues because of the Optimizer API.  (See 
discussion above: The constructors for Optimizer require Gradient and Updater.) 
 Cleaning up the Optimizer API will facilitate moving to the builder API.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486178#comment-14486178
 ] 

Joseph K. Bradley edited comment on SPARK-6682 at 4/8/15 10:12 PM:
---

*Optimization*: I agree that this JIRA will be (or should be) blocked by 
updates to the optimization API.  It is getting to be high time to fix that.  
I'll link this to [SPARK-5256] and will add my thoughts there.

*Splitting into sub-tasks*: I think this should be split into subtasks per 
algorithm, rather than splitting up deprecation/example/documentation.  That 
way, each subtask makes a consistent change but should be sufficiently small.


was (Author: josephkb):
*Optimization*: I agree that this JIRA will be blocked by updates to the 
optimization API.  It is getting to be high time to fix that.  I'll link this 
to [SPARK-5256] and will add my thoughts there.

*Splitting into sub-tasks*: I think this should be split into subtasks per 
algorithm, rather than splitting up deprecation/example/documentation.  That 
way, each subtask makes a consistent change but should be sufficiently small.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486178#comment-14486178
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

*Optimization*: I agree that this JIRA will be blocked by updates to the 
optimization API.  It is getting to be high time to fix that.  I'll link this 
to [SPARK-5256] and will add my thoughts there.

*Splitting into sub-tasks*: I think this should be split into subtasks per 
algorithm, rather than splitting up deprecation/example/documentation.  That 
way, each subtask makes a consistent change but should be sufficiently small.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6784) Clean up all the inbound/outbound conversions for DateType

2015-04-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6784:
--
Target Version/s: 1.4.0

> Clean up all the inbound/outbound conversions for DateType
> --
>
> Key: SPARK-6784
> URL: https://issues.apache.org/jira/browse/SPARK-6784
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Priority: Blocker
>
> We had changed  the JvmType of DateType to Int, but there still some places 
> putting java.sql.Date into Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6784) Clean up all the inbound/outbound conversions for DateType

2015-04-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6784:
--
Priority: Blocker  (was: Major)

> Clean up all the inbound/outbound conversions for DateType
> --
>
> Key: SPARK-6784
> URL: https://issues.apache.org/jira/browse/SPARK-6784
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Priority: Blocker
>
> We had changed  the JvmType of DateType to Int, but there still some places 
> putting java.sql.Date into Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486143#comment-14486143
 ] 

Zhan Zhang commented on SPARK-6479:
---

[~rxin] I updated the doc. If you think the overall design is OK, I will upload 
the initial patch for this JIRA soon. The patch will be just stub class, and 
the real migration for tachyon and new implementation for hdfs will be submit 
for review in spark-6112. Please let me know if you have any concern.

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6784) Clean up all the inbound/outbound conversions for DateType

2015-04-08 Thread Davies Liu (JIRA)

Davies Liu created SPARK-6784:
-

 Summary: Clean up all the inbound/outbound conversions for DateType
 Key: SPARK-6784
 URL: https://issues.apache.org/jira/browse/SPARK-6784
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


We had changed  the JvmType of DateType to Int, but there still some places 
putting java.sql.Date into Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-08 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-6479:
--
Attachment: SPARK-6479OffheapAPIdesign (1).pdf

Add exception from implementation

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign (1).pdf, 
> SPARK-6479OffheapAPIdesign.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields

2015-04-08 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486097#comment-14486097
 ] 

Davies Liu commented on SPARK-6677:
---

Spark 1.3.0 also works fine here, tried different N.

> pyspark.sql nondeterministic issue with row fields
> --
>
> Key: SPARK-6677
> URL: https://issues.apache.org/jira/browse/SPARK-6677
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark version: spark-1.3.0-bin-hadoop2.4
> python version: Python 2.7.6
> operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Stefano Parmesan
>Assignee: Davies Liu
>  Labels: pyspark, row, sql
>
> The following issue happens only when running pyspark in the python 
> interpreter, it works correctly with spark-submit.
> Reading two json files containing objects with a different structure leads 
> sometimes to the definition of wrong Rows, where the fields of a file are 
> used for the other one.
> I was able to write a sample code that reproduce this issue one out of three 
> times; the code snippet is available at the following link, together with 
> some (very simple) data samples:
> https://gist.github.com/armisael/e08bb4567d0a11efe2db



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6137) G-Means clustering algorithm implementation

2015-04-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486085#comment-14486085
 ] 

Joseph K. Bradley commented on SPARK-6137:
--

[~rumshenoy] Thanks for your interest!  If you're new to Spark contributions, 
I'd strongly recommend starting with smaller patches before working on a new 
algorithm.  This helps you to get used to Spark's review process, coding style, 
etc., and it helps reviewers and committers get to know you (since they need to 
allocate their time for reviewing carefully).  I'd recommend finding a JIRA to 
work on by browsing topics of interest to you and finding ones which sound 
smaller.  Here's some more info on contributing:
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

Once you're acclimated, you can request to work on a task like this again.  A 
task like this will require:
* Understanding the literature: Is this the best algorithm, and is it commonly 
used enough to be in Spark (as opposed to a package)?  (Others can help out 
here.)
* API design, implementation design, testing, documentation
* Scalability testing: Make sure the distributed implementation is efficient

> G-Means clustering algorithm implementation
> ---
>
> Key: SPARK-6137
> URL: https://issues.apache.org/jira/browse/SPARK-6137
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Denis Dus
>Priority: Minor
>  Labels: clustering
>
> Will it be useful to implement G-Means clustering algorithm based on K-Means?
> G-means is a powerful extension of k-means, which uses test of cluster data 
> normality to decide if it necessary to split current cluster into new two. 
> It's relative complexity (compared to k-Means) is O(K), where K is maximum 
> number of clusters. 
> The original paper is by Greg Hamerly and Charles Elkan from University of 
> California:
> [http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf]
> I also have a small prototype of this algorithm written in R (if anyone is 
> interested in it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields

2015-04-08 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486081#comment-14486081
 ] 

Davies Liu commented on SPARK-6677:
---

I still can not reproduce in on master, will retry with 1.3.0.

> pyspark.sql nondeterministic issue with row fields
> --
>
> Key: SPARK-6677
> URL: https://issues.apache.org/jira/browse/SPARK-6677
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark version: spark-1.3.0-bin-hadoop2.4
> python version: Python 2.7.6
> operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Stefano Parmesan
>Assignee: Davies Liu
>  Labels: pyspark, row, sql
>
> The following issue happens only when running pyspark in the python 
> interpreter, it works correctly with spark-submit.
> Reading two json files containing objects with a different structure leads 
> sometimes to the definition of wrong Rows, where the fields of a file are 
> used for the other one.
> I was able to write a sample code that reproduce this issue one out of three 
> times; the code snippet is available at the following link, together with 
> some (very simple) data samples:
> https://gist.github.com/armisael/e08bb4567d0a11efe2db



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2015-04-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486034#comment-14486034
 ] 

Josh Rosen commented on SPARK-5594:
---

With the introduction of ContextCleaner, I think there's no longer any reason 
for most users to enable the MetadataCleaner / {{spark.cleaner.ttl}} (except 
perhaps for super-long-lived Spark REPLs where you're worried about orphaning 
RDDs or broadcast variables in your REPL history and having them never get 
cleaned up, although I think this is an uncommon use-case).  I think that this 
property used to be relevant for Spark Streaming jobs, but I think that's no 
longer the case since the latest Streaming docs have removed all mentions of 
{{spark.cleaner.ttl}} (see 
https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817,
 for example).

See 
http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html
 for an old, related discussion.  Also, see 
https://github.com/apache/spark/pull/126, the PR that introduced the new 
ContextCleaner mechanism.

We should probably add a deprecation warning to {{spark.cleaner.ttl}} that 
advises users against using it, since it's an unsafe configuration option that 
can lead to confusing behavior if it's misused.

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.a

[jira] [Closed] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2015-04-08 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4346.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Weizhong
Target Version/s: 1.4.0

> YarnClientSchedulerBack.asyncMonitorApplication should be common with 
> Client.monitorApplication
> ---
>
> Key: SPARK-4346
> URL: https://issues.apache.org/jira/browse/SPARK-4346
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Weizhong
> Fix For: 1.4.0
>
>
> The YarnClientSchedulerBackend.asyncMonitorApplication routine should move 
> into ClientBase and be made common with monitorApplication.  Make sure stop 
> is handled properly.
> See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 209 matches

Mail list logo