[jira] [Created] (SPARK-13054) Always post TaskEnd event for tasks in cancelled stages

2016-01-27 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13054:
-

 Summary: Always post TaskEnd event for tasks in cancelled stages
 Key: SPARK-13054
 URL: https://issues.apache.org/jira/browse/SPARK-13054
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
// The success case is dealt with separately below.
// TODO: Why post it only for failed tasks in cancelled stages? Clarify 
semantics here.
if (event.reason != Success) {
  val attemptId = task.stageAttemptId
  listenerBus.post(SparkListenerTaskEnd(
stageId, attemptId, taskType, event.reason, event.taskInfo, 
taskMetrics))
}
{code}

Today we only post task end events for canceled stages if the task failed. 
There is no reason why we shouldn't just post it for all the tasks, including 
the ones that succeeded. If we do that we will be able to simplify another 
branch in the DAGScheduler, which needs a lot of simplification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13053:
-

 Summary: Rectify ignored tests in InternalAccumulatorSuite
 Key: SPARK-13053
 URL: https://issues.apache.org/jira/browse/SPARK-13053
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120278#comment-15120278
 ] 

Seth Hendrickson commented on SPARK-13047:
--

cc [~mengxr] who wrote the initial implementation.

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13052) waitingApps metric doesn't show the number of apps currently in the WAITING state

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13052:


Assignee: (was: Apache Spark)

> waitingApps metric doesn't show the number of apps currently in the WAITING 
> state
> -
>
> Key: SPARK-13052
> URL: https://issues.apache.org/jira/browse/SPARK-13052
> Project: Spark
>  Issue Type: Bug
>Reporter: Raafat Akkad
>Priority: Minor
> Attachments: Correct waitingApps.png, Incorrect waitingApps.png
>
>
> The master.waitingApps metric appears to not show the number of apps in the 
> WAITING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120469#comment-15120469
 ] 

Seth Hendrickson commented on SPARK-13047:
--

Since the change is small, I have submitted a PR. If we decide this is not the 
best route, we can close it.

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13052) waitingApps metric doesn't show the number of apps currently in the WAITING state

2016-01-27 Thread Raafat Akkad (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raafat Akkad updated SPARK-13052:
-
Attachment: Incorrect waitingApps.png

> waitingApps metric doesn't show the number of apps currently in the WAITING 
> state
> -
>
> Key: SPARK-13052
> URL: https://issues.apache.org/jira/browse/SPARK-13052
> Project: Spark
>  Issue Type: Bug
>Reporter: Raafat Akkad
>Priority: Minor
> Attachments: Correct waitingApps.png, Incorrect waitingApps.png
>
>
> The master.waitingApps metric appears to not show the number of apps in the 
> WAITING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13053:
--
Target Version/s: 2.0.0

> Rectify ignored tests in InternalAccumulatorSuite
> -
>
> Key: SPARK-13053
> URL: https://issues.apache.org/jira/browse/SPARK-13053
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // TODO: these two tests are incorrect; they don't actually trigger stage 
> retries.
> ignore("internal accumulators in fully resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
> }
> ignore("internal accumulators in partially resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
> subset
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-27 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120421#comment-15120421
 ] 

Wenchen Fan commented on SPARK-12783:
-

ah... finally I reproduced it in 1.6.0! Luckily this bug has already been fixed 
in 1.6 branch, could you give it a try?

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: MyMap.scala
>
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;
>   - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> name: arguments, type: interface scala.collection.Seq)
>   - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> staticinvoke(class 
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
> scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> 

[jira] [Assigned] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13047:


Assignee: Apache Spark

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13047:


Assignee: (was: Apache Spark)

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120463#comment-15120463
 ] 

Apache Spark commented on SPARK-13047:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/10962

> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when accessed as Python datetime

2016-01-27 Thread Daniel Jalova (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120471#comment-15120471
 ] 

Daniel Jalova commented on SPARK-12683:
---

This seems to be related to the 2038 problem where some systems aren't able to 
encode times after 03:14:07 UTC on 19 January 2038. The timestamps for dates 
before this work fine on my machines, but for dates after the time seems to off 
by an hour.

> SQL timestamp is wrong when accessed as Python datetime
> ---
>
> Key: SPARK-12683
> URL: https://issues.apache.org/jira/browse/SPARK-12683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Windows 7 Pro x64
> Python 3.4.3
> py4j 0.9
>Reporter: Gerhard Fiedler
> Attachments: spark_bug_date.py
>
>
> When accessing SQL timestamp data through {{.show()}}, it looks correct, but 
> when accessing it (as Python {{datetime}}) through {{.collect()}}, it is 
> wrong.
> {code}
> from datetime import datetime
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
> spark_context = SparkContext(appName='SparkBugTimestampHour')
> sql_context = SQLContext(spark_context)
> sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts"""
> data_frame = sql_context.sql(sql_text)
> data_frame.show(truncate=False)
> # Result from .show() (as expected, looks correct):
> # +--+
> # |ts|
> # +--+
> # |2100-09-09 12:11:10.09|
> # +--+
> rows = data_frame.collect()
> row = rows[0]
> ts = row[0]
> print('ts={ts}'.format(ts=ts))
> # Expected result from this print statement:
> # ts=2100-09-09 12:11:10.09
> #
> # Actual, wrong result (note the hours being 18 instead of 12):
> # ts=2100-09-09 18:11:10.09
> #
> # This error seems to be dependent on some characteristic of the system. 
> We couldn't reproduce
> # this on all of our systems, but it is not clear what the differences 
> are. One difference is
> # the processor: it failed on Intel Xeon E5-2687W v2.
> assert isinstance(ts, datetime)
> assert ts.year == 2100 and ts.month == 9 and ts.day == 9
> assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 9
> if ts.hour != 12:
> print('hour is not correct; should be 12, is actually 
> {hour}'.format(hour=ts.hour))
> spark_context.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-13047:
-
Description: 
Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
True if the class has a parameter by that name, but throws an 
{{AttributeError}} otherwise. There is not currently a way of getting a Boolean 
to indicate if a class has a parameter. With Spark 2.0 we could modify the 
existing behavior of {{hasParam}} or add an additional method with this 
functionality.

In Python:
{code}
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(smoothing=0.5)
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
{code}

produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
{code}
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
{code}

produces:
> true
> false

  was:Pyspark {{Params}} class has a method {{hasParam(paramName)}} which 
returns True if the class has a parameter by that name, but throws an 
{{AttributeError}} otherwise. There is not currently a way of getting a Boolean 
to indicate if a class has a parameter. With Spark 2.0 we could modify the 
existing behavior of {{hasParam}} or add an additional method with this 
functionality.


> Pyspark Params.hasParam should not throw an error
> -
>
> Key: SPARK-13047
> URL: https://issues.apache.org/jira/browse/SPARK-13047
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
> True if the class has a parameter by that name, but throws an 
> {{AttributeError}} otherwise. There is not currently a way of getting a 
> Boolean to indicate if a class has a parameter. With Spark 2.0 we could 
> modify the existing behavior of {{hasParam}} or add an additional method with 
> this functionality.
> In Python:
> {code}
> from pyspark.ml.classification import NaiveBayes
> nb = NaiveBayes(smoothing=0.5)
> print nb.hasParam("smoothing")
> print nb.hasParam("notAParam")
> {code}
> produces:
> > True
> > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
> However, in Scala:
> {code}
> import org.apache.spark.ml.classification.NaiveBayes
> val nb  = new NaiveBayes()
> nb.hasParam("smoothing")
> nb.hasParam("notAParam")
> {code}
> produces:
> > true
> > false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13052) waitingApps metric doesn't show the number of apps currently in the WAITING state

2016-01-27 Thread Raafat Akkad (JIRA)
Raafat Akkad created SPARK-13052:


 Summary: waitingApps metric doesn't show the number of apps 
currently in the WAITING state
 Key: SPARK-13052
 URL: https://issues.apache.org/jira/browse/SPARK-13052
 Project: Spark
  Issue Type: Bug
Reporter: Raafat Akkad
Priority: Minor


The master.waitingApps metric appears to not show the number of apps in the 
WAITING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12913) Reimplement stddev/variance as declarative function

2016-01-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12913:
---
Summary: Reimplement stddev/variance as declarative function  (was: 
Reimplement all builtin aggregate functions as declarative function)

> Reimplement stddev/variance as declarative function
> ---
>
> Key: SPARK-12913
> URL: https://issues.apache.org/jira/browse/SPARK-12913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> As benchmarked and discussed here: 
> https://github.com/apache/spark/pull/10786/files#r50038294.
> Benefits from codegen, the declarative aggregate function could be much 
> faster than imperative one,  we should re-implement all the builtin aggregate 
> functions as declarative one.
> For skewness and kurtosis, we need to benchmark it to make sure that the 
> declarative one is actually faster than imperative one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13053:
--
Description: 
{code}
// TODO: these two tests are incorrect; they don't actually trigger stage 
retries.
ignore("internal accumulators in fully resubmitted stages") {
  testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
}

ignore("internal accumulators in partially resubmitted stages") {
  testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
subset
}
{code}

> Rectify ignored tests in InternalAccumulatorSuite
> -
>
> Key: SPARK-13053
> URL: https://issues.apache.org/jira/browse/SPARK-13053
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // TODO: these two tests are incorrect; they don't actually trigger stage 
> retries.
> ignore("internal accumulators in fully resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
> }
> ignore("internal accumulators in partially resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
> subset
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12963:


Assignee: (was: Apache Spark)

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120414#comment-15120414
 ] 

Apache Spark commented on SPARK-12963:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10960

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12963:


Assignee: Apache Spark

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Assignee: Apache Spark
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13043) Support remaining types in ColumnarBatch

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13043:


Assignee: Apache Spark

> Support remaining types in ColumnarBatch
> 
>
> Key: SPARK-13043
> URL: https://issues.apache.org/jira/browse/SPARK-13043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
>
> The current implementation throws Not yet implemented for some types. 
> Implementing the remaining should be straightforward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13043) Support remaining types in ColumnarBatch

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13043:


Assignee: (was: Apache Spark)

> Support remaining types in ColumnarBatch
> 
>
> Key: SPARK-13043
> URL: https://issues.apache.org/jira/browse/SPARK-13043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>
> The current implementation throws Not yet implemented for some types. 
> Implementing the remaining should be straightforward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13043) Support remaining types in ColumnarBatch

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120430#comment-15120430
 ] 

Apache Spark commented on SPARK-13043:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10961

> Support remaining types in ColumnarBatch
> 
>
> Key: SPARK-13043
> URL: https://issues.apache.org/jira/browse/SPARK-13043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>
> The current implementation throws Not yet implemented for some types. 
> Implementing the remaining should be straightforward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120311#comment-15120311
 ] 

Apache Spark commented on SPARK-10620:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10958

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Attachments: accums-and-task-metrics.pdf
>
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13051) Do not maintain global singleton map for accumulators

2016-01-27 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13051:
-

 Summary: Do not maintain global singleton map for accumulators
 Key: SPARK-13051
 URL: https://issues.apache.org/jira/browse/SPARK-13051
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or


Right now we register accumulators through `Accumulators.register`, and then 
read these accumulators in DAGScheduler through `Accumulators.get`. This design 
is very awkward and makes it hard to associate a list of accumulators with a 
particular SparkContext. Global singleton is not a good pattern in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13054) Always post TaskEnd event for tasks in cancelled stages

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120362#comment-15120362
 ] 

Apache Spark commented on SPARK-13054:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10958

> Always post TaskEnd event for tasks in cancelled stages
> ---
>
> Key: SPARK-13054
> URL: https://issues.apache.org/jira/browse/SPARK-13054
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // The success case is dealt with separately below.
> // TODO: Why post it only for failed tasks in cancelled stages? Clarify 
> semantics here.
> if (event.reason != Success) {
>   val attemptId = task.stageAttemptId
>   listenerBus.post(SparkListenerTaskEnd(
> stageId, attemptId, taskType, event.reason, event.taskInfo, 
> taskMetrics))
> }
> {code}
> Today we only post task end events for canceled stages if the task failed. 
> There is no reason why we shouldn't just post it for all the tasks, including 
> the ones that succeeded. If we do that we will be able to simplify another 
> branch in the DAGScheduler, which needs a lot of simplification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13054) Always post TaskEnd event for tasks in cancelled stages

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13054:


Assignee: Apache Spark  (was: Andrew Or)

> Always post TaskEnd event for tasks in cancelled stages
> ---
>
> Key: SPARK-13054
> URL: https://issues.apache.org/jira/browse/SPARK-13054
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> {code}
> // The success case is dealt with separately below.
> // TODO: Why post it only for failed tasks in cancelled stages? Clarify 
> semantics here.
> if (event.reason != Success) {
>   val attemptId = task.stageAttemptId
>   listenerBus.post(SparkListenerTaskEnd(
> stageId, attemptId, taskType, event.reason, event.taskInfo, 
> taskMetrics))
> }
> {code}
> Today we only post task end events for canceled stages if the task failed. 
> There is no reason why we shouldn't just post it for all the tasks, including 
> the ones that succeeded. If we do that we will be able to simplify another 
> branch in the DAGScheduler, which needs a lot of simplification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13054) Always post TaskEnd event for tasks in cancelled stages

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13054:


Assignee: Andrew Or  (was: Apache Spark)

> Always post TaskEnd event for tasks in cancelled stages
> ---
>
> Key: SPARK-13054
> URL: https://issues.apache.org/jira/browse/SPARK-13054
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // The success case is dealt with separately below.
> // TODO: Why post it only for failed tasks in cancelled stages? Clarify 
> semantics here.
> if (event.reason != Success) {
>   val attemptId = task.stageAttemptId
>   listenerBus.post(SparkListenerTaskEnd(
> stageId, attemptId, taskType, event.reason, event.taskInfo, 
> taskMetrics))
> }
> {code}
> Today we only post task end events for canceled stages if the task failed. 
> There is no reason why we shouldn't just post it for all the tasks, including 
> the ones that succeeded. If we do that we will be able to simplify another 
> branch in the DAGScheduler, which needs a lot of simplification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13053:
--
Affects Version/s: 2.0.0

> Rectify ignored tests in InternalAccumulatorSuite
> -
>
> Key: SPARK-13053
> URL: https://issues.apache.org/jira/browse/SPARK-13053
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // TODO: these two tests are incorrect; they don't actually trigger stage 
> retries.
> ignore("internal accumulators in fully resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
> }
> ignore("internal accumulators in partially resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
> subset
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2016-01-27 Thread Imran Younus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120306#comment-15120306
 ] 

Imran Younus commented on SPARK-4144:
-

[~cfregly] [~josephkb] If there is still interest in this jira, I would really 
like to work on this!

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13037) PySpark ml.recommendation support export/import

2016-01-27 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120341#comment-15120341
 ] 

Evan Chen commented on SPARK-13037:
---

Hey Yanbo,

I can take this one.

Thank you

> PySpark ml.recommendation support export/import
> ---
>
> Key: SPARK-13037
> URL: https://issues.apache.org/jira/browse/SPARK-13037
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/recommendation.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13052) waitingApps metric doesn't show the number of apps currently in the WAITING state

2016-01-27 Thread Raafat Akkad (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raafat Akkad updated SPARK-13052:
-
Attachment: Correct waitingApps.png

> waitingApps metric doesn't show the number of apps currently in the WAITING 
> state
> -
>
> Key: SPARK-13052
> URL: https://issues.apache.org/jira/browse/SPARK-13052
> Project: Spark
>  Issue Type: Bug
>Reporter: Raafat Akkad
>Priority: Minor
> Attachments: Correct waitingApps.png
>
>
> The master.waitingApps metric appears to not show the number of apps in the 
> WAITING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13045) Clean up ColumnarBatch.Row and Column.Struct

2016-01-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13045.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10952
[https://github.com/apache/spark/pull/10952]

> Clean up ColumnarBatch.Row and Column.Struct
> 
>
> Key: SPARK-13045
> URL: https://issues.apache.org/jira/browse/SPARK-13045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> These two started more different but are now basically the same. Remove one 
> of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13057) Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120822#comment-15120822
 ] 

Apache Spark commented on SPARK-13057:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10965

> Add benchmark codes and the performance results for implemented compression 
> schemes for InMemoryRelation
> 
>
> Key: SPARK-13057
> URL: https://issues.apache.org/jira/browse/SPARK-13057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket adds benchmark codes for in-memory cache compression to make 
> future developments and discussions more smooth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13057) Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13057:


Assignee: Apache Spark

> Add benchmark codes and the performance results for implemented compression 
> schemes for InMemoryRelation
> 
>
> Key: SPARK-13057
> URL: https://issues.apache.org/jira/browse/SPARK-13057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> This ticket adds benchmark codes for in-memory cache compression to make 
> future developments and discussions more smooth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13057) Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13057:


Assignee: (was: Apache Spark)

> Add benchmark codes and the performance results for implemented compression 
> schemes for InMemoryRelation
> 
>
> Key: SPARK-13057
> URL: https://issues.apache.org/jira/browse/SPARK-13057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket adds benchmark codes for in-memory cache compression to make 
> future developments and discussions more smooth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12989) Bad interaction between StarExpansion and ExtractWindowExpressions

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12989:


Assignee: Apache Spark

> Bad interaction between StarExpansion and ExtractWindowExpressions
> --
>
> Key: SPARK-12989
> URL: https://issues.apache.org/jira/browse/SPARK-12989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Reported initially here: 
> http://stackoverflow.com/questions/34995376/apache-spark-window-function-with-nested-column
> {code}
> import sqlContext.implicits._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.expressions.Window
> sql("SET spark.sql.eagerAnalysis=false") // Let us see the error even though 
> we are constructing an invalid tree
> val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", 
> "num")
>   .withColumn("Data", struct("A", "B", "C"))
>   .drop("A")
>   .drop("B")
>   .drop("C")
> val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
> data.select($"*", max("num").over(winSpec) as "max").explain(true)
> {code}
> When you run this, the analyzer inserts invalid columns into a projection, as 
> seen below:
> {code}
> == Parsed Logical Plan ==
> 'Project [*,'max('num) windowspecdefinition('Data.A,'Data.B,'num 
> DESC,UnspecifiedFrame) AS max#64928]
> +- Project [num#64926,Data#64927]
>+- Project [C#64925,num#64926,Data#64927]
>   +- Project [B#64924,C#64925,num#64926,Data#64927]
>  +- Project 
> [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS 
> Data#64927]
> +- Project [_1#64919 AS A#64923,_2#64920 AS B#64924,_3#64921 AS 
> C#64925,_4#64922 AS num#64926]
>+- LocalRelation [_1#64919,_2#64920,_3#64921,_4#64922], 
> [[a,b,c,3],[c,b,a,3]]
> == Analyzed Logical Plan ==
> num: int, Data: struct, max: int
> Project [num#64926,Data#64927,max#64928]
> +- Project [num#64926,Data#64927,A#64932,B#64933,max#64928,max#64928]
>+- Window [num#64926,Data#64927,A#64932,B#64933], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax(num#64926)
>  windowspecdefinition(A#64932,B#64933,num#64926 DESC,RANGE BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) AS max#64928], [A#64932,B#64933], [num#64926 DESC]
>   +- !Project [num#64926,Data#64927,A#64932,B#64933]
>  +- Project [num#64926,Data#64927]
> +- Project [C#64925,num#64926,Data#64927]
>+- Project [B#64924,C#64925,num#64926,Data#64927]
>   +- Project 
> [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS 
> Data#64927]
>  +- Project [_1#64919 AS A#64923,_2#64920 AS 
> B#64924,_3#64921 AS C#64925,_4#64922 AS num#64926]
> +- LocalRelation 
> [_1#64919,_2#64920,_3#64921,_4#64922], [[a,b,c,3],[c,b,a,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12989) Bad interaction between StarExpansion and ExtractWindowExpressions

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120677#comment-15120677
 ] 

Apache Spark commented on SPARK-12989:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10963

> Bad interaction between StarExpansion and ExtractWindowExpressions
> --
>
> Key: SPARK-12989
> URL: https://issues.apache.org/jira/browse/SPARK-12989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Reported initially here: 
> http://stackoverflow.com/questions/34995376/apache-spark-window-function-with-nested-column
> {code}
> import sqlContext.implicits._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.expressions.Window
> sql("SET spark.sql.eagerAnalysis=false") // Let us see the error even though 
> we are constructing an invalid tree
> val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", 
> "num")
>   .withColumn("Data", struct("A", "B", "C"))
>   .drop("A")
>   .drop("B")
>   .drop("C")
> val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
> data.select($"*", max("num").over(winSpec) as "max").explain(true)
> {code}
> When you run this, the analyzer inserts invalid columns into a projection, as 
> seen below:
> {code}
> == Parsed Logical Plan ==
> 'Project [*,'max('num) windowspecdefinition('Data.A,'Data.B,'num 
> DESC,UnspecifiedFrame) AS max#64928]
> +- Project [num#64926,Data#64927]
>+- Project [C#64925,num#64926,Data#64927]
>   +- Project [B#64924,C#64925,num#64926,Data#64927]
>  +- Project 
> [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS 
> Data#64927]
> +- Project [_1#64919 AS A#64923,_2#64920 AS B#64924,_3#64921 AS 
> C#64925,_4#64922 AS num#64926]
>+- LocalRelation [_1#64919,_2#64920,_3#64921,_4#64922], 
> [[a,b,c,3],[c,b,a,3]]
> == Analyzed Logical Plan ==
> num: int, Data: struct, max: int
> Project [num#64926,Data#64927,max#64928]
> +- Project [num#64926,Data#64927,A#64932,B#64933,max#64928,max#64928]
>+- Window [num#64926,Data#64927,A#64932,B#64933], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax(num#64926)
>  windowspecdefinition(A#64932,B#64933,num#64926 DESC,RANGE BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) AS max#64928], [A#64932,B#64933], [num#64926 DESC]
>   +- !Project [num#64926,Data#64927,A#64932,B#64933]
>  +- Project [num#64926,Data#64927]
> +- Project [C#64925,num#64926,Data#64927]
>+- Project [B#64924,C#64925,num#64926,Data#64927]
>   +- Project 
> [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS 
> Data#64927]
>  +- Project [_1#64919 AS A#64923,_2#64920 AS 
> B#64924,_3#64921 AS C#64925,_4#64922 AS num#64926]
> +- LocalRelation 
> [_1#64919,_2#64920,_3#64921,_4#64922], [[a,b,c,3],[c,b,a,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12989) Bad interaction between StarExpansion and ExtractWindowExpressions

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12989:


Assignee: (was: Apache Spark)

> Bad interaction between StarExpansion and ExtractWindowExpressions
> --
>
> Key: SPARK-12989
> URL: https://issues.apache.org/jira/browse/SPARK-12989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Reported initially here: 
> http://stackoverflow.com/questions/34995376/apache-spark-window-function-with-nested-column
> {code}
> import sqlContext.implicits._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.expressions.Window
> sql("SET spark.sql.eagerAnalysis=false") // Let us see the error even though 
> we are constructing an invalid tree
> val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", 
> "num")
>   .withColumn("Data", struct("A", "B", "C"))
>   .drop("A")
>   .drop("B")
>   .drop("C")
> val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
> data.select($"*", max("num").over(winSpec) as "max").explain(true)
> {code}
> When you run this, the analyzer inserts invalid columns into a projection, as 
> seen below:
> {code}
> == Parsed Logical Plan ==
> 'Project [*,'max('num) windowspecdefinition('Data.A,'Data.B,'num 
> DESC,UnspecifiedFrame) AS max#64928]
> +- Project [num#64926,Data#64927]
>+- Project [C#64925,num#64926,Data#64927]
>   +- Project [B#64924,C#64925,num#64926,Data#64927]
>  +- Project 
> [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS 
> Data#64927]
> +- Project [_1#64919 AS A#64923,_2#64920 AS B#64924,_3#64921 AS 
> C#64925,_4#64922 AS num#64926]
>+- LocalRelation [_1#64919,_2#64920,_3#64921,_4#64922], 
> [[a,b,c,3],[c,b,a,3]]
> == Analyzed Logical Plan ==
> num: int, Data: struct, max: int
> Project [num#64926,Data#64927,max#64928]
> +- Project [num#64926,Data#64927,A#64932,B#64933,max#64928,max#64928]
>+- Window [num#64926,Data#64927,A#64932,B#64933], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax(num#64926)
>  windowspecdefinition(A#64932,B#64933,num#64926 DESC,RANGE BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW) AS max#64928], [A#64932,B#64933], [num#64926 DESC]
>   +- !Project [num#64926,Data#64927,A#64932,B#64933]
>  +- Project [num#64926,Data#64927]
> +- Project [C#64925,num#64926,Data#64927]
>+- Project [B#64924,C#64925,num#64926,Data#64927]
>   +- Project 
> [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS 
> Data#64927]
>  +- Project [_1#64919 AS A#64923,_2#64920 AS 
> B#64924,_3#64921 AS C#64925,_4#64922 AS num#64926]
> +- LocalRelation 
> [_1#64919,_2#64920,_3#64921,_4#64922], [[a,b,c,3],[c,b,a,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13057) Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation

2016-01-27 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13057:


 Summary: Add benchmark codes and the performance results for 
implemented compression schemes for InMemoryRelation
 Key: SPARK-13057
 URL: https://issues.apache.org/jira/browse/SPARK-13057
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


This ticket adds benchmark codes for in-memory cache compression to make future 
developments and discussions more smooth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13058) spark-shell start failure

2016-01-27 Thread geting (JIRA)
geting created SPARK-13058:
--

 Summary: spark-shell start failure
 Key: SPARK-13058
 URL: https://issues.apache.org/jira/browse/SPARK-13058
 Project: Spark
  Issue Type: Bug
 Environment: sh -version
GNU bash(bdsh), version 3.00.22(2)-release (x86_64-redhat-linux-gnu)
Linux version 2.6.32_1-17-0-0
(gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) )
Reporter: geting
Priority: Blocker


following errors shown after run sh ./bin/spark-shell
/bin/spark-class: line 85: syntax error near unexpected token `"$ARG"'
/bin/spark-class: line 85: `  CMD+=("$ARG")'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12977:


Assignee: (was: Apache Spark)

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12977:


Assignee: Apache Spark

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120890#comment-15120890
 ] 

Apache Spark commented on SPARK-12977:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/10966

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13055) SQLHistoryListener throws ClassCastException

2016-01-27 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13055:
-

 Summary: SQLHistoryListener throws ClassCastException
 Key: SPARK-13055
 URL: https://issues.apache.org/jira/browse/SPARK-13055
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
16/01/27 18:46:28 ERROR ReplayListenerBus: Listener SQLHistoryListener threw an 
exception
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:334)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:332)
{code}

SQLHistoryListener listens on SparkListenerTaskEnd events, which contain 
non-SQL accumulators as well. We try to cast all accumulators we encounter to 
Long, resulting in an error like this one.

Note: this was a problem even before internal accumulators were introduced. If  
the task used a user accumulator of a type other than Long, we would still see 
this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13056) Map column would throw NPE if value is null

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120816#comment-15120816
 ] 

Apache Spark commented on SPARK-13056:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10964

> Map column would throw NPE if value is null
> ---
>
> Key: SPARK-13056
> URL: https://issues.apache.org/jira/browse/SPARK-13056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> Create a map like
> { "a": "somestring",
>   "b": null}
> Query like
> SELECT col["b"] FROM t1;
> NPE would be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13056) Map column would throw NPE if value is null

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13056:


Assignee: (was: Apache Spark)

> Map column would throw NPE if value is null
> ---
>
> Key: SPARK-13056
> URL: https://issues.apache.org/jira/browse/SPARK-13056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> Create a map like
> { "a": "somestring",
>   "b": null}
> Query like
> SELECT col["b"] FROM t1;
> NPE would be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13056) Map column would throw NPE if value is null

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13056:


Assignee: Apache Spark

> Map column would throw NPE if value is null
> ---
>
> Key: SPARK-13056
> URL: https://issues.apache.org/jira/browse/SPARK-13056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Apache Spark
>
> Create a map like
> { "a": "somestring",
>   "b": null}
> Query like
> SELECT col["b"] FROM t1;
> NPE would be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13052) waitingApps metric doesn't show the number of apps currently in the WAITING state

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13052:


Assignee: Apache Spark

> waitingApps metric doesn't show the number of apps currently in the WAITING 
> state
> -
>
> Key: SPARK-13052
> URL: https://issues.apache.org/jira/browse/SPARK-13052
> Project: Spark
>  Issue Type: Bug
>Reporter: Raafat Akkad
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Correct waitingApps.png, Incorrect waitingApps.png
>
>
> The master.waitingApps metric appears to not show the number of apps in the 
> WAITING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13052) waitingApps metric doesn't show the number of apps currently in the WAITING state

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120357#comment-15120357
 ] 

Apache Spark commented on SPARK-13052:
--

User 'RaafatAkkad' has created a pull request for this issue:
https://github.com/apache/spark/pull/10959

> waitingApps metric doesn't show the number of apps currently in the WAITING 
> state
> -
>
> Key: SPARK-13052
> URL: https://issues.apache.org/jira/browse/SPARK-13052
> Project: Spark
>  Issue Type: Bug
>Reporter: Raafat Akkad
>Priority: Minor
> Attachments: Correct waitingApps.png, Incorrect waitingApps.png
>
>
> The master.waitingApps metric appears to not show the number of apps in the 
> WAITING state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13032) Basic ML Pipeline export/import functions for PySpark

2016-01-27 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-13032:
---

 Summary: Basic ML Pipeline export/import functions for PySpark
 Key: SPARK-13032
 URL: https://issues.apache.org/jira/browse/SPARK-13032
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang


* Implement Basic ML Pipeline export/import 
functions(MLWriter/MLWritable/MLReader/MLReadable) for PySpark.
* Making LinearRegression to support save/load as example. After this merged, 
the work for other transformers/estimators will be easy, then we can list and 
distribute the tasks to the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13031) Improve test coverage for whole stage codegen

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118838#comment-15118838
 ] 

Apache Spark commented on SPARK-13031:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10944

> Improve test coverage for whole stage codegen
> -
>
> Key: SPARK-13031
> URL: https://issues.apache.org/jira/browse/SPARK-13031
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118840#comment-15118840
 ] 

Hyukjin Kwon commented on SPARK-12890:
--

[~rxin] Could you confirm if this is an issue?

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13032) Basic ML Pipeline export/import functions for PySpark

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13032:


Assignee: Apache Spark

> Basic ML Pipeline export/import functions for PySpark
> -
>
> Key: SPARK-13032
> URL: https://issues.apache.org/jira/browse/SPARK-13032
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> * Implement Basic ML Pipeline export/import 
> functions(MLWriter/MLWritable/MLReader/MLReadable) for PySpark.
> * Making LinearRegression to support save/load as example. After this merged, 
> the work for other transformers/estimators will be easy, then we can list and 
> distribute the tasks to the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13031) Improve test coverage for whole stage codegen

2016-01-27 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13031:
--

 Summary: Improve test coverage for whole stage codegen
 Key: SPARK-13031
 URL: https://issues.apache.org/jira/browse/SPARK-13031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13032) Basic ML Pipeline export/import functions for PySpark

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118839#comment-15118839
 ] 

Apache Spark commented on SPARK-13032:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10469

> Basic ML Pipeline export/import functions for PySpark
> -
>
> Key: SPARK-13032
> URL: https://issues.apache.org/jira/browse/SPARK-13032
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> * Implement Basic ML Pipeline export/import 
> functions(MLWriter/MLWritable/MLReader/MLReadable) for PySpark.
> * Making LinearRegression to support save/load as example. After this merged, 
> the work for other transformers/estimators will be easy, then we can list and 
> distribute the tasks to the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13031) Improve test coverage for whole stage codegen

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13031:


Assignee: Davies Liu  (was: Apache Spark)

> Improve test coverage for whole stage codegen
> -
>
> Key: SPARK-13031
> URL: https://issues.apache.org/jira/browse/SPARK-13031
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13032) Basic ML Pipeline export/import functions for PySpark

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13032:


Assignee: (was: Apache Spark)

> Basic ML Pipeline export/import functions for PySpark
> -
>
> Key: SPARK-13032
> URL: https://issues.apache.org/jira/browse/SPARK-13032
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> * Implement Basic ML Pipeline export/import 
> functions(MLWriter/MLWritable/MLReader/MLReadable) for PySpark.
> * Making LinearRegression to support save/load as example. After this merged, 
> the work for other transformers/estimators will be easy, then we can list and 
> distribute the tasks to the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13031) Improve test coverage for whole stage codegen

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13031:


Assignee: Apache Spark  (was: Davies Liu)

> Improve test coverage for whole stage codegen
> -
>
> Key: SPARK-13031
> URL: https://issues.apache.org/jira/browse/SPARK-13031
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13033) PySpark regression support export/import

2016-01-27 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-13033:
---

 Summary: PySpark regression support export/import
 Key: SPARK-13033
 URL: https://issues.apache.org/jira/browse/SPARK-13033
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


Add export/import for all estimators and transformers(which have Scala 
implementation) under pyspark/ml/regression.py. Please refer the implementation 
at SPARK-13032. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13042) Adding Specificity metric (= true negative rate) to MulticlassMetrics.scala

2016-01-27 Thread Daniel Marcous (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119706#comment-15119706
 ] 

Daniel Marcous commented on SPARK-13042:


[~srowen] Forgive me for my ignorance, read the guidelines and tried to make it 
right this time.
Also added an update to the test file to include this metric.
It seemed necessary and useful from me, I believe its an underused but 
important metric to take into account in some cases (including some of my own).
Seemed more logical to me to include it as a metric here instead of doing it 
manually in user code in addition to the already existing metrics.

Then again, you are the expert so I'll happily accept your decision once you 
get to one.


> Adding Specificity metric (= true negative rate) to MulticlassMetrics.scala
> ---
>
> Key: SPARK-13042
> URL: https://issues.apache.org/jira/browse/SPARK-13042
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Daniel Marcous
>Priority: Minor
>  Labels: metrics, mllib
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Adding Specificity metric (= true negative rate) to MulticlassMetrics.scala
> 1. Add specificity calculations:
> pecificity (SPC) or true negative rate
> SPC = TN / N = TN / (TN+FP) 
> False Positive Rate (FPR)
> FPR = FP / N = FP / (FP + TN) = 1-SPC
> SPC = 1 - FPR
> 2. Add specificity tests to :
> mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119673#comment-15119673
 ] 

Yin Huai commented on SPARK-9740:
-

Thank you!

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13023) Check for presence of 'root' module after computing test_modules, not changed_modules

2016-01-27 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13023.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10933
[https://github.com/apache/spark/pull/10933]

> Check for presence of 'root' module after computing test_modules, not 
> changed_modules
> -
>
> Key: SPARK-13023
> URL: https://issues.apache.org/jira/browse/SPARK-13023
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
> Fix For: 2.0.0
>
>
> There's a minor bug in how we handle the `root` module in the 
> `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends 
> on `build` (since every test needs to run on any build test), we now need to 
> check for the presence of root in `modules_to_test` instead of 
> `changed_modules`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13023) Check for presence of 'root' module after computing test_modules, not changed_modules

2016-01-27 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-13023:
--

Assignee: Josh Rosen

> Check for presence of 'root' module after computing test_modules, not 
> changed_modules
> -
>
> Key: SPARK-13023
> URL: https://issues.apache.org/jira/browse/SPARK-13023
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> There's a minor bug in how we handle the `root` module in the 
> `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends 
> on `build` (since every test needs to run on any build test), we now need to 
> check for the presence of root in `modules_to_test` instead of 
> `changed_modules`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13058) spark-shell start fail

2016-01-27 Thread geting (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

geting updated SPARK-13058:
---
Summary: spark-shell start fail  (was: spark-shell start failure)

> spark-shell start fail
> --
>
> Key: SPARK-13058
> URL: https://issues.apache.org/jira/browse/SPARK-13058
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
> Environment: sh -version
> GNU bash(bdsh), version 3.00.22(2)-release (x86_64-redhat-linux-gnu)
> Linux version 2.6.32_1-17-0-0
> (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) )
>Reporter: geting
>Priority: Blocker
>
> following errors shown after run sh ./bin/spark-shell
> /bin/spark-class: line 85: syntax error near unexpected token `"$ARG"'
> /bin/spark-class: line 85: `  CMD+=("$ARG")'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13058) spark-shell start failure

2016-01-27 Thread geting (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

geting updated SPARK-13058:
---
Affects Version/s: 1.6.0

> spark-shell start failure
> -
>
> Key: SPARK-13058
> URL: https://issues.apache.org/jira/browse/SPARK-13058
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
> Environment: sh -version
> GNU bash(bdsh), version 3.00.22(2)-release (x86_64-redhat-linux-gnu)
> Linux version 2.6.32_1-17-0-0
> (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) )
>Reporter: geting
>Priority: Blocker
>
> following errors shown after run sh ./bin/spark-shell
> /bin/spark-class: line 85: syntax error near unexpected token `"$ARG"'
> /bin/spark-class: line 85: `  CMD+=("$ARG")'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11012) Canonicalize view definitions

2016-01-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11012:

Issue Type: New Feature  (was: Bug)

> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively. Building 
> on top of that work, we need to canonicalize the view definition. So, for a 
> SQL string SELECT a, b FROM table, we will save this text to Hive metastore 
> as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12818) Implement Bloom filter and count-min sketch in DataFrames

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120975#comment-15120975
 ] 

Apache Spark commented on SPARK-12818:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10968

> Implement Bloom filter and count-min sketch in DataFrames
> -
>
> Key: SPARK-12818
> URL: https://issues.apache.org/jira/browse/SPARK-12818
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
> Attachments: BloomFilterandCount-MinSketchinSpark2.0.pdf
>
>
> This ticket tracks implementing Bloom filter and count-min sketch support in 
> DataFrames. Please see the attached design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12987) Drop fails when columns contain dots

2016-01-27 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120922#comment-15120922
 ] 

Jayadevan M commented on SPARK-12987:
-

I think the exception is occurred during select method invocation with 
remaining columns from drop(). So we need to use 
'queryExecution.analyzed.output' instead of 'queryExecution.analyzed.schema' in 
drop method to resolve remaining columns.
What you think

> Drop fails when columns contain dots
> 
>
> Key: SPARK-12987
> URL: https://issues.apache.org/jira/browse/SPARK-12987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a_b").collect()
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input 
> columns a_b, a.c;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
>   at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286)
> {code}



--
This message was sent by 

[jira] [Commented] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120961#comment-15120961
 ] 

Apache Spark commented on SPARK-10777:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/10967

> order by fails when column is aliased and projection includes windowed 
> aggregate
> 
>
> Key: SPARK-10777
> URL: https://issues.apache.org/jira/browse/SPARK-10777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> This statement fails in SPARK (works fine in ORACLE, DB2 )
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
> columns c1, c2; line 3 pos 9
> SQLState:  null
> ErrorCode: 0
> Forcing the aliased column name works around the defect
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> These work fine
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> select r as c1, s  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> create table  if not exists TINT ( RNUM int , CINT int   )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS ORC  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13034) PySpark ml.classification support export/import

2016-01-27 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120925#comment-15120925
 ] 

Miao Wang commented on SPARK-13034:
---

Yanbo,

Thanks for your reply! I will learn the background of this bug and follow your 
example. I will wait for 13032 merge.

Thanks!

Miao

> PySpark ml.classification support export/import
> ---
>
> Key: SPARK-13034
> URL: https://issues.apache.org/jira/browse/SPARK-13034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/classification.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-27 Thread Wang, Gang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang closed SPARK-13004.
--
Resolution: Later

Preparing actionable items so close it temporarily as Sean Owen suggested.

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-27 Thread Yanping Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanping Wang updated SPARK-13004:
-
Comment: was deleted

(was: Hi, Sean, in order for this model to work, we also developed a memory 
library to support this computation model. we are planning to donate this 
library to Apache incubator. If you are interested, I can send you a proposal 
draft. )

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13043) Support remaining types in ColumnarBatch

2016-01-27 Thread Nong Li (JIRA)
Nong Li created SPARK-13043:
---

 Summary: Support remaining types in ColumnarBatch
 Key: SPARK-13043
 URL: https://issues.apache.org/jira/browse/SPARK-13043
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nong Li


The current implementation throws Not yet implemented for some types. 
Implementing the remaining should be straightforward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12780) Inconsistency returning value of ML python models' properties

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119912#comment-15119912
 ] 

Apache Spark commented on SPARK-12780:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/10950

> Inconsistency returning value of ML python models' properties
> -
>
> Key: SPARK-12780
> URL: https://issues.apache.org/jira/browse/SPARK-12780
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> In spark/python/pyspark/ml/feature.py, StringIndexerModel has a property 
> method named labels, which is different with other properties in other models.
> In StringIndexerModel:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
> @property
> @since("1.5.0")
> def labels(self):
> """
> Ordered list of labels, corresponding to indices to be assigned.
> """
> return self._java_obj.labels
> {code}
> In CounterVectorizerModel (as an example):
> {code:title=CounterVectorizerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
> @property
> @since("1.6.0")
> def vocabulary(self):
> """
> An array of terms in the vocabulary.
> """
> return self._call_java("vocabulary")
> {code}
> In StringIndexerModel, the returned value of labels is not an array of labels 
> as expected. Otherwise it is a JavaMember of py4j.
> What's more, the Pickle in Python side cannot deserialize Scala Array 
> normally. According to my experiments, it translates Array[String] into 
> Tuple, Array[Int] to array.array. It may bring some errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2016-01-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10847.
--
   Resolution: Fixed
Fix Version/s: 1.6.1
   1.5.3
   2.0.0

Issue resolved by pull request 8969
[https://github.com/apache/spark/pull/8969]

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
> Fix For: 2.0.0, 1.5.3, 1.6.1
>
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most 

[jira] [Issue Comment Deleted] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Lin Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Zhao updated SPARK-12591:
-
Comment: was deleted

(was: Failed batches with mapWithState)

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-27 at 10.09.18 AM.png
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
> localhost, partition 1,NODE_LOCAL, 2239 bytes)
> 15/12/31 11:43:47 INFO Executor: Running task 1.0 in stage 4.0 (TID 6)
> 15/12/31 11:43:47 WARN 

[jira] [Updated] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Lin Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Zhao updated SPARK-12591:
-
Attachment: Screen Shot 2016-01-27 at 10.09.18 AM.png

Failed batches with mapWithState

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-27 at 10.09.18 AM.png
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
> localhost, partition 1,NODE_LOCAL, 2239 bytes)
> 15/12/31 11:43:47 INFO Executor: Running task 1.0 in stage 4.0 (TID 

[jira] [Commented] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Lin Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119907#comment-15119907
 ] 

Lin Zhao commented on SPARK-12591:
--

I'm not sure of the timing of the checkpointing. The error seems to affect 
several early batches then go away. Not sure what to make of it.

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
> localhost, partition 1,NODE_LOCAL, 2239 bytes)
> 15/12/31 11:43:47 INFO Executor: Running task 

[jira] [Comment Edited] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Lin Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119907#comment-15119907
 ] 

Lin Zhao edited comment on SPARK-12591 at 1/27/16 6:11 PM:
---

I'm not sure of the timing of the checkpointing. The error seems to affect 
several early batches then go away. Not sure what to make of it. Attached 
screenshot of the streaming page.


was (Author: l...@exabeam.com):
I'm not sure of the timing of the checkpointing. The error seems to affect 
several early batches then go away. Not sure what to make of it.

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-27 at 10.09.18 AM.png
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> 

[jira] [Commented] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119888#comment-15119888
 ] 

Shixiong Zhu commented on SPARK-12591:
--

By the way, just want to confirm that: the error doesn't always happen for the 
batches that need to be checkpointed, and the checkpoint just sometimes fails. 
Right?

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
> localhost, partition 1,NODE_LOCAL, 2239 bytes)
> 15/12/31 

[jira] [Updated] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2016-01-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10847:
-
Assignee: Jason C Lee

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Assignee: Jason C Lee
>Priority: Minor
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> 

[jira] [Created] (SPARK-13044) saveAsTextFile() failed to save file to AWS S3 EU-Frankfort

2016-01-27 Thread Xin Ren (JIRA)
Xin Ren created SPARK-13044:
---

 Summary: saveAsTextFile() failed to save file to AWS S3 
EU-Frankfort
 Key: SPARK-13044
 URL: https://issues.apache.org/jira/browse/SPARK-13044
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
 Environment: CentOS
Reporter: Xin Ren


I have two clusters deployed: US and EU-Frankfort with the same configs on AWS. 

And the application in EU-Frankfort cannot save data to EU-s3, but US one can 
save to US-s3.

And we checked and found that EU-Frankfort supports Signature Version 4 only: 
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region

Code I'm using:
{code:java}
val s3WriteEndpoint = "s3n://access_key:secret_key@bucket_name/data/12345"
rdd.saveAsTextFile(s3WriteEndpoint)
{code}

So from my issue I guess saveAsTextFile() is using Signature Version 2? How to 
support Version 4?

I tried to dig into code
https://github.com/apache/spark/blob/f14922cff84b1e0984ba4597d764615184126bdc/core/src/main/scala/org/apache/spark/rdd/RDD.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-27 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119970#comment-15119970
 ] 

Wang, Gang commented on SPARK-13004:


I have closed it according to your advice, Thanks.

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13022) Shade jackson core

2016-01-27 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved SPARK-13022.

Resolution: Won't Fix

> Shade jackson core
> --
>
> Key: SPARK-13022
> URL: https://issues.apache.org/jira/browse/SPARK-13022
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Ted Yu
>Priority: Minor
>
> See the thread for background information:
> http://search-hadoop.com/m/q3RTtYuufRO7LLG
> This issue proposes to shade com.fasterxml.jackson.core as 
> org.spark-project.jackson



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13046) Partitioning looks broken in 1.6

2016-01-27 Thread Julien Baley (JIRA)
Julien Baley created SPARK-13046:


 Summary: Partitioning looks broken in 1.6
 Key: SPARK-13046
 URL: https://issues.apache.org/jira/browse/SPARK-13046
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Julien Baley


Hello,

I have a list of files in s3:

s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair.

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120023#comment-15120023
 ] 

Shixiong Zhu commented on SPARK-12591:
--

Where is LiveSequenceMergeTracker, in a separate jar or the application jar? I 
tried both and didn't fail in yarn-client or yarn-cluster mode. It's great to 
hear that it worked with branch-1.6 :)

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-27 at 10.09.18 AM.png
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/31 11:43:47 INFO TaskSetManager: 

[jira] [Updated] (SPARK-13046) Partitioning looks broken in 1.6

2016-01-27 Thread Julien Baley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Baley updated SPARK-13046:
-
Description: 
Hello,

I have a list of files in s3:

s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair (which 
otherwise would catch the bug)

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.

  was:
Hello,

I have a list of files in s3:

s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}
s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
 parquet files}

Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
for the three lines) would correctly identify 2 pairs of key/value, one 
`date_received` and one `fingerprint`.

>From 1.6.0, I get the following exception:
assertion failed: Conflicting directory structures detected. Suspicious paths
s3://bucket/some_path/date_received=2016-01-13
s3://bucket/some_path/date_received=2016-01-14
s3://bucket/some_path/date_received=2016-01-15

That is to say, the partitioning code now fails to identify 
date_received=2016-01-13 as a key/value pair.

I can see that there has been some activity on 
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 recently, so that seems related (especially the commits 
https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b 
 and 
https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 
).
If I read correctly the tests added in those commits:
-they don't seem to actually test the return value, only that it doesn't crash
-they only test cases where the s3 path contain 1 key/value pair.

This is problematic for us as we're trying to migrate all of our spark services 
to 1.6.0 and this bug is a real blocker. I know it's possible to force a 
'union', but I'd rather not do that if the bug can be fixed.

Any question, please shoot.


> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
> 

[jira] [Updated] (SPARK-13044) saveAsTextFile() failed to save file to AWS S3 EU-Frankfort

2016-01-27 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13044:

Description: 
I have two clusters deployed: US and EU-Frankfort with the same configs on AWS. 

And the application in EU-Frankfort cannot save data to EU-Frankfort-s3, but US 
one can save to US-s3.

And I checked and found that EU-Frankfort supports Signature Version 4 only: 
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region

Code I'm using:
{code:java}
val s3WriteEndpoint = "s3n://access_key:secret_key@bucket_name/data/12345"
rdd.saveAsTextFile(s3WriteEndpoint)
{code}

So from my issue I guess saveAsTextFile() is using Signature Version 2? How to 
support Version 4?

I tried to dig into code
https://github.com/apache/spark/blob/f14922cff84b1e0984ba4597d764615184126bdc/core/src/main/scala/org/apache/spark/rdd/RDD.scala

  was:
I have two clusters deployed: US and EU-Frankfort with the same configs on AWS. 

And the application in EU-Frankfort cannot save data to EU-s3, but US one can 
save to US-s3.

And we checked and found that EU-Frankfort supports Signature Version 4 only: 
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region

Code I'm using:
{code:java}
val s3WriteEndpoint = "s3n://access_key:secret_key@bucket_name/data/12345"
rdd.saveAsTextFile(s3WriteEndpoint)
{code}

So from my issue I guess saveAsTextFile() is using Signature Version 2? How to 
support Version 4?

I tried to dig into code
https://github.com/apache/spark/blob/f14922cff84b1e0984ba4597d764615184126bdc/core/src/main/scala/org/apache/spark/rdd/RDD.scala


> saveAsTextFile() failed to save file to AWS S3 EU-Frankfort
> ---
>
> Key: SPARK-13044
> URL: https://issues.apache.org/jira/browse/SPARK-13044
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.0
> Environment: CentOS
>Reporter: Xin Ren
>  Labels: aws-s3
>
> I have two clusters deployed: US and EU-Frankfort with the same configs on 
> AWS. 
> And the application in EU-Frankfort cannot save data to EU-Frankfort-s3, but 
> US one can save to US-s3.
> And I checked and found that EU-Frankfort supports Signature Version 4 only: 
> http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
> Code I'm using:
> {code:java}
> val s3WriteEndpoint = "s3n://access_key:secret_key@bucket_name/data/12345"
> rdd.saveAsTextFile(s3WriteEndpoint)
> {code}
> So from my issue I guess saveAsTextFile() is using Signature Version 2? How 
> to support Version 4?
> I tried to dig into code
> https://github.com/apache/spark/blob/f14922cff84b1e0984ba4597d764615184126bdc/core/src/main/scala/org/apache/spark/rdd/RDD.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12944) CrossValidator doesn't accept a Pipeline as an estimator

2016-01-27 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119976#comment-15119976
 ] 

Seth Hendrickson commented on SPARK-12944:
--

Based on what I have seen, I think the fix is more involved... could you 
elaborate on your proposed solution? 

We could "fix" it in addGrid by checking that all elements of {{values}} are 
instances of {{param.expectedType}}, but I don't think this really gets at the 
root of the problem. First, this would still throw an error with ints that are 
expected to be floats, which is unsatisfactory since [PR 
9581|https://github.com/apache/spark/pull/9581] adds a type conversion 
mechanism specifically for this purpose. Second, the real problem is that 
params passed through the {{Estimator.fit}} and {{Transformer.transform}} 
methods use the {{Params.copy}} method to circumvent the set methods and type 
checking entirely! By modifying the {{_paramMap}} dictionary directly, the copy 
method allows ml pipeline elements to contain params that have no context (like 
a HashingTF containing a Naive Bayes smoothing parameter in its param map as in 
the example above).

I think the correct way to do this is to change the {{copy}} method to call the 
{{_set}} function instead of directly modifying {{_paramMap}}. That way we 
ensure that the {{_paramMap}} only contains parameters that belong to that 
class and that type checking is performed. I am working on a PR with the method 
I described, but I am open to feedback. Appreciate any thoughts on this.

> CrossValidator doesn't accept a Pipeline as an estimator
> 
>
> Key: SPARK-12944
> URL: https://issues.apache.org/jira/browse/SPARK-12944
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0
> Environment: spark-1.6.0-bin-hadoop2.6
> Python 3.4.4 :: Anaconda 2.4.1
>Reporter: John Hogue
>Priority: Minor
>
> Pipeline is supposed to act as an estimator which CrossValidator currently 
> throws error.
> {code}
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> from pyspark.ml.tuning import ParamGridBuilder
> from pyspark.ml.tuning import CrossValidator
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and nb.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> nb = NaiveBayes()
> pipeline = Pipeline(stages=[tokenizer, hashingTF, nb])
> paramGrid = ParamGridBuilder().addGrid(nb.smoothing, [0, 1]).build()
> cv = CrossValidator(estimator=pipeline, 
> estimatorParamMaps=paramGrid, 
> evaluator=MulticlassClassificationEvaluator(), 
> numFolds=4)
> cvModel = cv.fit(training_df)
> {code}
> Sample dataset can be found here:
> https://github.com/dreyco676/nlp_spark/blob/master/data.zip
> The file can be converted to a DataFrame with:
> {code}
> # Load precleaned training set
> training_rdd = sc.textFile("data/clean_training.txt")
> parts_rdd = training_rdd.map(lambda l: l.split("\t"))
> # Filter bad rows out
> garantee_col_rdd = parts_rdd.filter(lambda l: len(l) == 3)
> typed_rdd = garantee_col_rdd.map(lambda p: (p[0], p[1], float(p[2])))
> # Create DataFrame
> training_df = sqlContext.createDataFrame(typed_rdd, ["id", "text", "label"])
> {code}
> Running the pipeline throws the following stack trace:
> {code}
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()
>  17 numFolds=4)
>  18 
> ---> 19 cvModel = cv.fit(training_df)
> /Users/dreyco676/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/pipeline.py in 
> fit(self, dataset, params)
>  67 return self.copy(params)._fit(dataset)
>  68 else:
> ---> 69 return self._fit(dataset)
>  70 else:
>  71 raise ValueError("Params must be either a param map or a 
> list/tuple of param maps, "
> /Users/dreyco676/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/tuning.py in 
> _fit(self, dataset)
> 237 train = df.filter(~condition)
> 238 for j in range(numModels):
> --> 239 model = est.fit(train, epm[j])
> 240 # TODO: duplicate evaluator to take extra params from 
> input
> 241 metric = eva.evaluate(model.transform(validation, 
> epm[j]))
> /Users/dreyco676/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/pipeline.py in 
> fit(self, dataset, params)
>  65 elif isinstance(params, dict):
>  66 if params:
> ---> 67 return self.copy(params)._fit(dataset)
>  68 else:
>  

[jira] [Created] (SPARK-13045) Clean up ColumnarBatch.Row and Column.Struct

2016-01-27 Thread Nong Li (JIRA)
Nong Li created SPARK-13045:
---

 Summary: Clean up ColumnarBatch.Row and Column.Struct
 Key: SPARK-13045
 URL: https://issues.apache.org/jira/browse/SPARK-13045
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nong Li
Priority: Minor


These two started more different but are now basically the same. Remove one of 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12895) Implement TaskMetrics using accumulators

2016-01-27 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12895.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10835
[https://github.com/apache/spark/pull/10835]

> Implement TaskMetrics using accumulators
> 
>
> Key: SPARK-12895
> URL: https://issues.apache.org/jira/browse/SPARK-12895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> We need to first do this before we can avoid sending TaskMetrics from the 
> executors to the driver. After we do this, we can send only accumulator 
> updates instead of both that AND TaskMetrics.
> By the end of this issue TaskMetrics will be a wrapper of accumulators. It 
> will be only syntactic sugar for setting these accumulators.
> But first, we need to express everything in TaskMetrics as accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

2016-01-27 Thread Lin Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119987#comment-15119987
 ] 

Lin Zhao commented on SPARK-12591:
--

I patched the server with branch-1.6, reverted the workaround and the error 
doesn't happen again.

> NullPointerException using checkpointed mapWithState with KryoSerializer
> 
>
> Key: SPARK-12591
> URL: https://issues.apache.org/jira/browse/SPARK-12591
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: MacOSX
> Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
>Reporter: Jan Uyttenhove
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-27 at 10.09.18 AM.png
>
>
> Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to 
> (re)test the new mapWithState API, after previously reporting issue 
> SPARK-11932 (https://issues.apache.org/jira/browse/SPARK-11932). 
> For initial report, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html
> Narrowed it down to an issue unrelated to Kafka directstream, but, after 
> observing very unpredictable behavior as a result of changes to the Kafka 
> messages format, it seems to be related to KryoSerialization in specific.
> For test case, see my modified version of the StatefulNetworkWordCount 
> example: https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 
> To reproduce, use RC4 of Spark-1.6.0 and 
> - start nc:
> {code}nc -lk {code}
> - execute the supplied test case: 
> {code}bin/spark-submit --class 
> org.apache.spark.examples.streaming.StatefulNetworkWordCount --master 
> local[2] file:///some-assembly-jar localhost {code}
> Error scenario:
> - put some text in the nc console with the job running, and observe correct 
> functioning of the word count
> - kill the spark job
> - add some more text in the nc console (with the job not running)
> - restart the spark job and observe the NPE
> (you might need to repeat this a couple of times to trigger the exception)
> Here's the stacktrace: 
> {code}
> 15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
> localhost, partition 1,NODE_LOCAL, 2239 bytes)
> 15/12/31 11:43:47 

[jira] [Commented] (SPARK-13034) PySpark ml.classification support export/import

2016-01-27 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119959#comment-15119959
 ] 

Miao Wang commented on SPARK-13034:
---

Hi Yanbo,

I saw you opened several of these sub-tasks.
Can I try to take this one? Or you have a plan to do it by yourself.

Thanks!

Miao

> PySpark ml.classification support export/import
> ---
>
> Key: SPARK-13034
> URL: https://issues.apache.org/jira/browse/SPARK-13034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/classification.py. Please refer the 
> implementation at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12950) Improve performance of BytesToBytesMap

2016-01-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12950:
--

Assignee: Davies Liu

> Improve performance of BytesToBytesMap
> --
>
> Key: SPARK-12950
> URL: https://issues.apache.org/jira/browse/SPARK-12950
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When benchmark generated aggregate with grouping keys, the profiling show 
> that lookup in BytesToBytesMap took about 90% of the CPU time, we should 
> optimize it.
> After profiling with jvisualvm, here are the things that take most of the 
> time:
> 1. decode address from Long to baseObject and offset
> 2. calculate hash code
> 3. compare the bytes (equality check)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12950) Improve performance of BytesToBytesMap

2016-01-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12950:
---
Assignee: (was: Davies Liu)

> Improve performance of BytesToBytesMap
> --
>
> Key: SPARK-12950
> URL: https://issues.apache.org/jira/browse/SPARK-12950
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> When benchmark generated aggregate with grouping keys, the profiling show 
> that lookup in BytesToBytesMap took about 90% of the CPU time, we should 
> optimize it.
> After profiling with jvisualvm, here are the things that take most of the 
> time:
> 1. decode address from Long to baseObject and offset
> 2. calculate hash code
> 3. compare the bytes (equality check)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13047) Pyspark Params.hasParam should not throw an error

2016-01-27 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-13047:


 Summary: Pyspark Params.hasParam should not throw an error
 Key: SPARK-13047
 URL: https://issues.apache.org/jira/browse/SPARK-13047
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: Seth Hendrickson
Priority: Minor


Pyspark {{Params}} class has a method {{hasParam(paramName)}} which returns 
True if the class has a parameter by that name, but throws an 
{{AttributeError}} otherwise. There is not currently a way of getting a Boolean 
to indicate if a class has a parameter. With Spark 2.0 we could modify the 
existing behavior of {{hasParam}} or add an additional method with this 
functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-01-27 Thread Jeff Stein (JIRA)
Jeff Stein created SPARK-13048:
--

 Summary: EMLDAOptimizer deletes dependent checkpoint of 
DistributedLDAModel
 Key: SPARK-13048
 URL: https://issues.apache.org/jira/browse/SPARK-13048
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.5.2
 Environment: Standalone Spark cluster
Reporter: Jeff Stein


In EMLDAOptimizer, all checkpoints are deleted before returning the 
DistributedLDAModel.

The most recent checkpoint is still necessary for operations on the 
DistributedLDAModel under a couple scenarios:
- The graph doesn't fit in memory on the worker nodes (e.g. very large data set)
- Late worker failures that require reading the now-dependent checkpoint.

I ran into this problem running a 10M record LDA model in a memory starved 
environment. The model persistently failed in either the {{collect at 
LDAModel.scala:528}} stage when converting to a LocalLDAModel or in the 
{{reduce at LDAModel.scala:563}} stage when calling "describeTopics". In both 
cases, a FileNotFoundException is thrown attempting to access a checkpoint file.

I'm not sure what the correct fix is here; it might involve a class signature 
change. An alternative simple fix is to leave the last checkpoint around and 
expect the user to clean the checkpoint directory themselves.

{noformat}
java.io.FileNotFoundException: File does not exist: 
/hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
{noformat}

Relevant code is included below.

LDAOptimizer.scala:
{noformat}
  override private[clustering] def getLDAModel(iterationTimes: Array[Double]): 
LDAModel = {
require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
this.graphCheckpointer.deleteAllCheckpoints()
// The constructor's default arguments assume gammaShape = 100 to ensure 
equivalence in
// LDAModel.toLocal conversion
new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
this.vocabSize,
  Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
this.topicConcentration,
  iterationTimes)
  }
{noformat}

PeriodicCheckpointer.scala

{noformat}
  /**
   * Call this at the end to delete any remaining checkpoint files.
   */
  def deleteAllCheckpoints(): Unit = {
while (checkpointQueue.nonEmpty) {
  removeCheckpointFile()
}
  }

  /**
   * Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
   * This prints a warning but does not fail if the files cannot be removed.
   */
  private def removeCheckpointFile(): Unit = {
val old = checkpointQueue.dequeue()
// Since the old checkpoint is not deleted by Spark, we manually delete it.
val fs = FileSystem.get(sc.hadoopConfiguration)
getCheckpointFiles(old).foreach { checkpointFile =>
  try {
fs.delete(new Path(checkpointFile), true)
  } catch {
case e: Exception =>
  logWarning("PeriodicCheckpointer could not remove old checkpoint 
file: " +
checkpointFile)
  }
}
  }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13045) Clean up ColumnarBatch.Row and Column.Struct

2016-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13045:


Assignee: (was: Apache Spark)

> Clean up ColumnarBatch.Row and Column.Struct
> 
>
> Key: SPARK-13045
> URL: https://issues.apache.org/jira/browse/SPARK-13045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Priority: Minor
>
> These two started more different but are now basically the same. Remove one 
> of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13045) Clean up ColumnarBatch.Row and Column.Struct

2016-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120094#comment-15120094
 ] 

Apache Spark commented on SPARK-13045:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10952

> Clean up ColumnarBatch.Row and Column.Struct
> 
>
> Key: SPARK-13045
> URL: https://issues.apache.org/jira/browse/SPARK-13045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Priority: Minor
>
> These two started more different but are now basically the same. Remove one 
> of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-27 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120113#comment-15120113
 ] 

Mark Grover commented on SPARK-12177:
-

I have issued a new PR https://github.com/apache/spark/pull/10953 for this 
which contains all of Nikita's changes as well. Please feel free to review and 
comment there.

The python implementation is not in that PR just yet, it's being worked on 
separately at 
https://github.com/markgrover/spark/tree/kafka09-integration-python (for now, 
anyways).

The new package is called 'newapi' instead of 'v09'.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >