[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API

2017-05-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005904#comment-16005904
 ] 

Liang-Chi Hsieh commented on SPARK-12225:
-

Without knowing this issue, I've implemented a {{withColumns}} API in Dataset 
in SPARK-20542. It benefits ML usage a lot and gets better performance results. 
For ML pipelines which can chain dozens of stages, if we do withColumn in each 
stage, the total cost grows big fast.

> Support adding or replacing multiple columns at once in DataFrame API
> -
>
> Key: SPARK-12225
> URL: https://issues.apache.org/jira/browse/SPARK-12225
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> Currently, withColumn() method of DataFrame supports adding or replacing only 
> single column. It would be convenient to support adding or replacing multiple 
> columns at once.
> Also withColumnRenamed() supports renaming only single column.It would also 
> be convenient to support renaming multiple columns at once.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20704) CRAN test should run single threaded

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20704:


Assignee: Apache Spark

> CRAN test should run single threaded
> 
>
> Key: SPARK-20704
> URL: https://issues.apache.org/jira/browse/SPARK-20704
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20704) CRAN test should run single threaded

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20704:


Assignee: (was: Apache Spark)

> CRAN test should run single threaded
> 
>
> Key: SPARK-20704
> URL: https://issues.apache.org/jira/browse/SPARK-20704
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20704) CRAN test should run single threaded

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005899#comment-16005899
 ] 

Apache Spark commented on SPARK-20704:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17945

> CRAN test should run single threaded
> 
>
> Key: SPARK-20704
> URL: https://issues.apache.org/jira/browse/SPARK-20704
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20704) CRAN test should run single threaded

2017-05-10 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20704:


 Summary: CRAN test should run single threaded
 Key: SPARK-20704
 URL: https://issues.apache.org/jira/browse/SPARK-20704
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005869#comment-16005869
 ] 

Wenchen Fan commented on SPARK-20590:
-

We only prefer internal data source if the given name is a short name like 
"csv", "json", etc. Using full name still works.

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Hyukjin Kwon
> Fix For: 2.2.1, 2.3.0
>
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005864#comment-16005864
 ] 

Felix Cheung commented on SPARK-20590:
--

When the user explicitly specifies the package to use, shouldn't that take 
priority over the internal one?
say if there is a better csv implementation exists as a spark package, then 
right now there is no way to use it.


> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Hyukjin Kwon
> Fix For: 2.2.1, 2.3.0
>
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-10 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Description: 
seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in other 
test runs too, always only when running ML tests, it seems

{code}
Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: Attempted 
to access garbage collected accumulator 159454
at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
at 
org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
at 
org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
at 
org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
1
MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..

{code}

{code}
java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474)
at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173)
at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala)
at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 

[jira] [Commented] (SPARK-20228) Random Forest instable results depending on spark.executor.memory

2017-05-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005844#comment-16005844
 ] 

Hyukjin Kwon commented on SPARK-20228:
--

gentle ping [~Ansgar Schulze]

> Random Forest instable results depending on spark.executor.memory
> -
>
> Key: SPARK-20228
> URL: https://issues.apache.org/jira/browse/SPARK-20228
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ansgar Schulze
>
> If I deploy a random forrest modeling with example 
> spark.executor.memory20480M
> I got another result as if i depoy the modeling with
> spark.executor.memory6000M
> I excpected the same results but different runtimes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20369) pyspark: Dynamic configuration with SparkConf does not work

2017-05-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005842#comment-16005842
 ] 

Hyukjin Kwon commented on SPARK-20369:
--

I am resolving this as I can't reproduce as above and it looks the reporter is 
inactive.

> pyspark: Dynamic configuration with SparkConf does not work
> ---
>
> Key: SPARK-20369
> URL: https://issues.apache.org/jira/browse/SPARK-20369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) 
> and Mac OS X 10.11.6
>Reporter: Matthew McClain
>Priority: Minor
>
> Setting spark properties dynamically in pyspark using SparkConf object does 
> not work. Here is the code that shows the bug:
> ---
> from pyspark import SparkContext, SparkConf
> def main():
> conf = SparkConf().setAppName("spark-conf-test") \
> .setMaster("local[2]") \
> .set('spark.python.worker.memory',"1g") \
> .set('spark.executor.memory',"3g") \
> .set("spark.driver.maxResultSize","2g")
> print "Spark Config values in SparkConf:"
> print conf.toDebugString()
> sc = SparkContext(conf=conf)
> print "Actual Spark Config values:"
> print sc.getConf().toDebugString()
> if __name__  == "__main__":
> main()
> ---
> Here is the output; none of the config values set in SparkConf are used in 
> the SparkContext configuration:
> Spark Config values in SparkConf:
> spark.master=local[2]
> spark.executor.memory=3g
> spark.python.worker.memory=1g
> spark.app.name=spark-conf-test
> spark.driver.maxResultSize=2g
> 17/04/18 10:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Actual Spark Config values:
> spark.app.id=local-1492528885708
> spark.app.name=sandbox.py
> spark.driver.host=10.201.26.172
> spark.driver.maxResultSize=4g
> spark.driver.port=54657
> spark.executor.id=driver
> spark.files=file:/Users/matt.mcclain/dev/datascience-experiments/mmcclain/client_clusters/sandbox.py
> spark.master=local[*]
> spark.rdd.compress=True
> spark.serializer.objectStreamReset=100
> spark.submit.deployMode=client



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20369) pyspark: Dynamic configuration with SparkConf does not work

2017-05-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20369.
--
Resolution: Cannot Reproduce

> pyspark: Dynamic configuration with SparkConf does not work
> ---
>
> Key: SPARK-20369
> URL: https://issues.apache.org/jira/browse/SPARK-20369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) 
> and Mac OS X 10.11.6
>Reporter: Matthew McClain
>Priority: Minor
>
> Setting spark properties dynamically in pyspark using SparkConf object does 
> not work. Here is the code that shows the bug:
> ---
> from pyspark import SparkContext, SparkConf
> def main():
> conf = SparkConf().setAppName("spark-conf-test") \
> .setMaster("local[2]") \
> .set('spark.python.worker.memory',"1g") \
> .set('spark.executor.memory',"3g") \
> .set("spark.driver.maxResultSize","2g")
> print "Spark Config values in SparkConf:"
> print conf.toDebugString()
> sc = SparkContext(conf=conf)
> print "Actual Spark Config values:"
> print sc.getConf().toDebugString()
> if __name__  == "__main__":
> main()
> ---
> Here is the output; none of the config values set in SparkConf are used in 
> the SparkContext configuration:
> Spark Config values in SparkConf:
> spark.master=local[2]
> spark.executor.memory=3g
> spark.python.worker.memory=1g
> spark.app.name=spark-conf-test
> spark.driver.maxResultSize=2g
> 17/04/18 10:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Actual Spark Config values:
> spark.app.id=local-1492528885708
> spark.app.name=sandbox.py
> spark.driver.host=10.201.26.172
> spark.driver.maxResultSize=4g
> spark.driver.port=54657
> spark.executor.id=driver
> spark.files=file:/Users/matt.mcclain/dev/datascience-experiments/mmcclain/client_clusters/sandbox.py
> spark.master=local[*]
> spark.rdd.compress=True
> spark.serializer.objectStreamReset=100
> spark.submit.deployMode=client



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20606) ML 2.2 QA: Remove deprecated methods for ML

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005835#comment-16005835
 ] 

Apache Spark commented on SPARK-20606:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17944

> ML 2.2 QA: Remove deprecated methods for ML
> ---
>
> Key: SPARK-20606
> URL: https://issues.apache.org/jira/browse/SPARK-20606
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.2.0
>
>
> Remove ML methods we deprecated in 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005823#comment-16005823
 ] 

Imran Rashid commented on SPARK-19354:
--

did a bit more searching -- isn't this fixed by SPARK-20217 & SPARK-20358 ?

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 

[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005817#comment-16005817
 ] 

Thomas Graves commented on SPARK-19354:
---

Right from what I've seen not a blacklisting bug. Bug with speculative tasks 
being marked as failed rather then killed which then leads to the executor 
being blacklisted.

Not sure on the oom part never saw that.

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> 

[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005795#comment-16005795
 ] 

Marcelo Vanzin commented on SPARK-20608:


You still haven't understood what I'm saying. You should *NOT* be putting 
namenode addresses in {{spark.yarn.access.namenodes}} if you're using HA. You 
should be putting the namespace address. Simple. If that doesn't work, then 
there's a bug somewhere. But your current fix is wrong.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005789#comment-16005789
 ] 

Apache Spark commented on SPARK-20682:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17943

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20703) Add an operator for writing data out

2017-05-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005780#comment-16005780
 ] 

Liang-Chi Hsieh commented on SPARK-20703:
-

[~rxin] Thanks for ping me. Sure. I'd love to take this.

> Add an operator for writing data out
> 
>
> Key: SPARK-20703
> URL: https://issues.apache.org/jira/browse/SPARK-20703
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> We should add an operator for writing data out. Right now in the explain plan 
> / UI there is no way to tell whether a query is writing data out, and also 
> there is no way to associate metrics with data writes. It'd be tremendously 
> valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20703) Add an operator for writing data out

2017-05-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20703:
---

 Summary: Add an operator for writing data out
 Key: SPARK-20703
 URL: https://issues.apache.org/jira/browse/SPARK-20703
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


We should add an operator for writing data out. Right now in the explain plan / 
UI there is no way to tell whether a query is writing data out, and also there 
is no way to associate metrics with data writes. It'd be tremendously valuable 
to do this for adding metrics and for visibility.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20703) Add an operator for writing data out

2017-05-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005765#comment-16005765
 ] 

Reynold Xin commented on SPARK-20703:
-

cc [~viirya] want to give this a try?


> Add an operator for writing data out
> 
>
> Key: SPARK-20703
> URL: https://issues.apache.org/jira/browse/SPARK-20703
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> We should add an operator for writing data out. Right now in the explain plan 
> / UI there is no way to tell whether a query is writing data out, and also 
> there is no way to associate metrics with data writes. It'd be tremendously 
> valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005767#comment-16005767
 ] 

Imran Rashid commented on SPARK-19354:
--

[~tgraves] I haven't run into this yet -- frankly I still steer users away from 
speculation for the most part.  I don't know of any fix for this.  I can see 
how this messes up blacklisting in particular, but just to make sure I 
understand right, this isn't a blacklisting bug, right?  the problem is that 
killing speculative tasks has some unintended side effects, right? IIUC, the 
original task looks like it failed for the wrong reason, and even worse, the 
entire executor dies, so other tasks running on the executor fail?

I don't understand this part:

bq. When sorter spill to disk, the task is killed. Then a interruptedExecption 
is thrown. Then OOM will be thrown

how does the interrupted exception lead to an OOM, and killing the executor?  I 
can't see how speculative execution could be used effectively if killing tasks 
can bring down an executor.

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> 

[jira] [Commented] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-05-10 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005757#comment-16005757
 ] 

Yuming Wang commented on SPARK-20200:
-

Can you check it again? it works for me.
{code}
build/sbt  "test-only org.apache.spark.rdd.LocalCheckpointSuite"
{code}

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at 

[jira] [Assigned] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20702:


Assignee: Apache Spark  (was: Shixiong Zhu)

> TaskContextImpl.markTaskCompleted should not hide the original error
> 
>
> Key: SPARK-20702
> URL: https://issues.apache.org/jira/browse/SPARK-20702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> If a TaskCompletionListener throws an error, 
> TaskContextImpl.markTaskCompleted will hide the original error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20702:


Assignee: Shixiong Zhu  (was: Apache Spark)

> TaskContextImpl.markTaskCompleted should not hide the original error
> 
>
> Key: SPARK-20702
> URL: https://issues.apache.org/jira/browse/SPARK-20702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If a TaskCompletionListener throws an error, 
> TaskContextImpl.markTaskCompleted will hide the original error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005722#comment-16005722
 ] 

Apache Spark commented on SPARK-20702:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17942

> TaskContextImpl.markTaskCompleted should not hide the original error
> 
>
> Key: SPARK-20702
> URL: https://issues.apache.org/jira/browse/SPARK-20702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If a TaskCompletionListener throws an error, 
> TaskContextImpl.markTaskCompleted will hide the original error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20702) TaskContextImpl.markTaskCompleted should not hide the original error

2017-05-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-20702:


 Summary: TaskContextImpl.markTaskCompleted should not hide the 
original error
 Key: SPARK-20702
 URL: https://issues.apache.org/jira/browse/SPARK-20702
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1, 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


If a TaskCompletionListener throws an error, TaskContextImpl.markTaskCompleted 
will hide the original error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

2017-05-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20685.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1
   2.1.2

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with 
> repeated argument
> ---
>
> Key: SPARK-20685
> URL: https://issues.apache.org/jira/browse/SPARK-20685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
> stage with a single UDF that takes more than one argument _where that 
> argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 180, in main
> process()
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 175, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 107, in 
> func = lambda _, it: map(mapper, it)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 93, in 
> mapper = lambda a: udf(*a)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 71, in 
> return lambda *a: f(*a)
> TypeError: () takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path 
> for handling a "batch UDF evaluation consisting of a single Python UDF, but 
> that branch incorrectly assumes that a single UDF won't have repeated 
> arguments and therefore skips the code for unpacking arguments from the input 
> row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20701) dataframe.show has wrong white space when containing Supplement Unicode character

2017-05-10 Thread Pingsan Song (JIRA)
Pingsan Song created SPARK-20701:


 Summary: dataframe.show has wrong white space when containing 
Supplement Unicode character
 Key: SPARK-20701
 URL: https://issues.apache.org/jira/browse/SPARK-20701
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.1.1
 Environment: Mac, Hadoop 2.7 prebuild
Reporter: Pingsan Song
Priority: Trivial


The character in the String is \u1D400, repeat 4 times.
I guess it would be the same for any supplement unicode character.

scala> var testRdd = sc.parallelize(Seq("")).toDF

scala> testDF.show(false)
++
|value   |
++
||
++



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20684:


Assignee: Apache Spark

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20684:


Assignee: (was: Apache Spark)

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005600#comment-16005600
 ] 

Apache Spark commented on SPARK-20684:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17941

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13210) NPE in Sort

2017-05-10 Thread David McWhorter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005591#comment-16005591
 ] 

David McWhorter edited comment on SPARK-13210 at 5/10/17 10:52 PM:
---

I think its the same error, here's the start of the stack trace with assertions 
disabled:

Caused by: java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:347)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortedIterator.loadNext(UnsafeInMemorySorter.java:301)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:216)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)



was (Author: dmcwhorter):
I suppose that may be a different error actually...

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13210) NPE in Sort

2017-05-10 Thread David McWhorter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005591#comment-16005591
 ] 

David McWhorter commented on SPARK-13210:
-

I suppose that may be a different error actually...

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13210) NPE in Sort

2017-05-10 Thread David McWhorter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005588#comment-16005588
 ] 

David McWhorter commented on SPARK-13210:
-

[~srowen] Here's the error from Spark 2.1.1 with assertions enabled:

[Stage 28:>  (0 + 7) / 60][Stage 36:>  (7 + 1) / 93][Stage 37:>  (0 + 0) / 
11]2017-05-10 18:43:05 ERROR TaskContextImpl:91 - Error in 
TaskCompletionListener
java.lang.AssertionError
at 
org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:288)
at 
org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:140)
at 
org.apache.spark.memory.MemoryConsumer.freeArray(MemoryConsumer.java:110)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:166)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:329)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:169)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:97)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:95)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:95)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2017-05-10 18:43:06 ERROR Executor:91 - Exception in task 1.0 in stage 28.0 
(TID 465)
org.apache.spark.util.TaskCompletionListenerException
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[Stage 28:>  (0 + 8) / 60][Stage 36:>  (7 + 1) / 93][Stage 37:>  (0 + 0) / 
11]2017-05-10 18:43:06 WARN  TaskSetManager:66 - Lost task 1.0 in stage 28.0 
(TID 465, localhost, executor driver): 
org.apache.spark.util.TaskCompletionListenerException
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2017-05-10 18:43:06 ERROR TaskSetManager:70 - Task 1 in stage 28.0 failed 1 
times; aborting job


> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> 

[jira] [Commented] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005553#comment-16005553
 ] 

Dongjoon Hyun commented on SPARK-20684:
---

Thank you for confirming!

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-05-10 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-20684:
---
Summary: expose createGlobalTempView and dropGlobalTempView in SparkR  
(was: expose createGlobalTempView in SparkR)

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR

2017-05-10 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005552#comment-16005552
 ] 

Hossein Falaki commented on SPARK-20684:


Yes I agree.

> expose createGlobalTempView in SparkR
> -
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR

2017-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005545#comment-16005545
 ] 

Dongjoon Hyun commented on SPARK-20684:
---

We need `dropGlobalTempView`, too.

> expose createGlobalTempView in SparkR
> -
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createGlobalTempView in SparkR

2017-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005527#comment-16005527
 ] 

Dongjoon Hyun commented on SPARK-20684:
---

Hi, [~falaki]. 
I'll make a PR for this.

> expose createGlobalTempView in SparkR
> -
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20700:
---
Description: 
The following (complicated) query eventually fails with a stack overflow during 
optimization:

{code}
CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
  ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
  ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
  ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
  ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
'-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
  ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
  ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
CAST(NULL AS TIMESTAMP), '-740'),
  ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
AS TIMESTAMP), CAST(NULL AS STRING)),
  ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
CAST(NULL AS TIMESTAMP), '181'),
  ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
  ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
STRING), CAST(NULL AS TIMESTAMP), '-62');

CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);

SELECT
AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
float_col,
COUNT(t1.smallint_col_2) AS int_col
FROM table_5 t1
INNER JOIN (
SELECT
(MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
(t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
AS boolean_col,
t2.a,
(t1.int_col_4) * (t1.int_col_4) AS int_col
FROM table_5 t1
LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
WHERE
(t1.smallint_col_2) > (t1.smallint_col_2)
GROUP BY
t2.a,
(t1.int_col_4) * (t1.int_col_4)
HAVING
((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
SUM(t1.int_col_4))
) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
((t2.a) = (t1.smallint_col_2));
{code}

(I haven't tried to minimize this failing case yet).

Based on sampled jstacks from the driver, it looks like the query might be 
repeatedly inferring filters from constraints and then pruning those filters.

Here's part of the stack at the point where it stackoverflows:

{code}
[... repeats ...]
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 

[jira] [Updated] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20700:
---
Summary: InferFiltersFromConstraints stackoverflows for query (v2)  (was: 
Expression canonicalization hits stack overflow for query)

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);
> SELECT
> AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
> float_col,
> COUNT(t1.smallint_col_2) AS int_col
> FROM table_5 t1
> INNER JOIN (
> SELECT
> (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
> (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
> AS boolean_col,
> t2.a,
> (t1.int_col_4) * (t1.int_col_4) AS int_col
> FROM table_5 t1
> LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
> WHERE
> (t1.smallint_col_2) > (t1.smallint_col_2)
> GROUP BY
> t2.a,
> (t1.int_col_4) * (t1.int_col_4)
> HAVING
> ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
> SUM(t1.int_col_4))
> ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
> ((t2.a) = (t1.smallint_col_2));
> {code}
> (I haven't tried to minimize this failing case yet).
> Based on sampled jstacks from the driver, it looks like the query might be 
> repeatedly inferring filters from constraints and then pruning those filters.
> Here's part of the stack at the point where it stackoverflows:
> {code}
> [... repeats ...]
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:344)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> 

[jira] [Created] (SPARK-20700) Expression canonicalization hits stack overflow for query

2017-05-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-20700:
--

 Summary: Expression canonicalization hits stack overflow for query
 Key: SPARK-20700
 URL: https://issues.apache.org/jira/browse/SPARK-20700
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.2.0
Reporter: Josh Rosen


The following (complicated) query eventually fails with a stack overflow during 
optimization:

{code}
CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
  ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
  ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
  ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
  ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
'-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
  ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
  ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
CAST(NULL AS TIMESTAMP), '-740'),
  ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
AS TIMESTAMP), CAST(NULL AS STRING)),
  ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
CAST(NULL AS TIMESTAMP), '181'),
  ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
  ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
STRING), CAST(NULL AS TIMESTAMP), '-62');

CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);

SELECT
AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
float_col,
COUNT(t1.smallint_col_2) AS int_col
FROM table_5 t1
INNER JOIN (
SELECT
(MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
(t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
AS boolean_col,
t2.a,
(t1.int_col_4) * (t1.int_col_4) AS int_col
FROM table_5 t1
LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
WHERE
(t1.smallint_col_2) > (t1.smallint_col_2)
GROUP BY
t2.a,
(t1.int_col_4) * (t1.int_col_4)
HAVING
((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
SUM(t1.int_col_4))
) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
((t2.a) = (t1.smallint_col_2));
{code}

(I haven't tried to minimize this failing case yet).

Based on sampled jstacks from the driver, it looks like the query might be 
repeatedly inferring filters from constraints and then pruning those filters.

Here's part of the stack at the point where it stackoverflows:

{code}
[... repeats ...]
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
at 

[jira] [Assigned] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20504:
-

Assignee: Weichen Xu  (was: Joseph K. Bradley)

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20699) The end of Python stdout/stderr streams may be lost by PythonRunner

2017-05-10 Thread Nick Gates (JIRA)
Nick Gates created SPARK-20699:
--

 Summary: The end of Python stdout/stderr streams may be lost by 
PythonRunner
 Key: SPARK-20699
 URL: https://issues.apache.org/jira/browse/SPARK-20699
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
Reporter: Nick Gates
Priority: Minor


The RedirectThread that copies over the Python stdout/err is never joined. And 
so the PythonRunner may throw an exception before all of the output is copied. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20698) =, ==, > is not working as expected when used in sql query

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20698.
---
   Resolution: Invalid
Fix Version/s: (was: 1.6.2)

This isn't a place to ask people to debug your code. You're better off posting 
a much narrowed down version to StackOverflow, with your data.

> =, ==, > is not working as expected when used in sql query
> --
>
> Key: SPARK-20698
> URL: https://issues.apache.org/jira/browse/SPARK-20698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: windows
>Reporter: someshwar kale
>Priority: Critical
>
> I have written below spark program- its not working as expected
> {code}
> package computedBatch;
> import org.apache.log4j.Level;
> import org.apache.log4j.Logger;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.DataFrame;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SQLContext;
> import org.apache.spark.sql.hive.HiveContext;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.ArrayList;
> import java.util.Arrays;
> import java.util.List;
> public class ArithmeticIssueTest {
> private transient JavaSparkContext javaSparkContext;
> private transient SQLContext sqlContext;
> public ArithmeticIssueTest() {
> Logger.getLogger("org").setLevel(Level.OFF);
> Logger.getLogger("akka").setLevel(Level.OFF);
> SparkConf conf = new 
> SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]");
> javaSparkContext = new JavaSparkContext(conf);
> sqlContext = new HiveContext(javaSparkContext);
> }
> public static void main(String[] args) {
>   ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest();
> arithmeticIssueTest.execute();
> }
> private void execute(){
> List data = Arrays.asList(
> "a1,1494389759,99.8793003568,325.389705932",
> "a1,1494389759,99.9472573803,325.27559502",
> "a1,1494389759,99.7887233987,325.334374851",
> "a1,1494389759,99.9547800925,325.371537062",
> "a1,1494389759,99.8039111691,325.305285877",
> "a1,1494389759,99.8342317379,325.24881354",
> "a1,1494389759,99.9849449235,325.396678931",
> "a1,1494389759,99.9396731311,325.336115345",
> "a1,1494389759,99.9320915068,325.242622938",
> "a1,1494389759,99.894669,325.320965146",
> "a1,1494389759,99.7735359781,325.345168334",
> "a1,1494389759,99.9698837734,325.352291407",
> "a1,1494389759,99.8418330703,325.296539372",
> "a1,1494389759,99.796315751,325.347570632",
> "a1,1494389759,99.7811931613,325.351137315",
> "a1,1494389759,99.9773765104,325.218131741",
> "a1,1494389759,99.8189825201,325.288197381",
> "a1,1494389759,99.8115005369,325.282327633",
> "a1,1494389759,99.9924539722,325.24048614",
> "a1,1494389759,99.9170191204,325.299431664");
> JavaRDD rawData = javaSparkContext.parallelize(data);
> List fields = new ArrayList<>();
> fields.add(DataTypes.createStructField("ASSET_ID", 
> DataTypes.StringType, true));
> fields.add(DataTypes.createStructField("TIMESTAMP", 
> DataTypes.LongType, true));
> fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, 
> true));
> fields.add(DataTypes.createStructField("temperature", 
> DataTypes.DoubleType, true));
> StructType schema = DataTypes.createStructType(fields);
> JavaRDD rowRDD = rawData.map(
> (Function) record -> {
> String[] fields1 = record.split(",");
> return RowFactory.create(
> fields1[0].trim(),
> Long.parseLong(fields1[1].trim()),
> Double.parseDouble(fields1[2].trim()),
> Double.parseDouble(fields1[3].trim()));
> });
> DataFrame df = sqlContext.createDataFrame(rowRDD, schema);
> df.show(false);
> df.registerTempTable("x_linkx1087571272_filtered");
> sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, 
> count(case when x_linkx1087571272_filtered" +
> ".temperature=325.0 then 1 

[jira] [Updated] (SPARK-20698) =, ==, > is not working as expected when used in sql query

2017-05-10 Thread someshwar kale (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

someshwar kale updated SPARK-20698:
---
Description: 
I have written below spark program- its not working as expected

{code}
package computedBatch;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;


public class ArithmeticIssueTest {
private transient JavaSparkContext javaSparkContext;
private transient SQLContext sqlContext;

public ArithmeticIssueTest() {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkConf conf = new 
SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]");
javaSparkContext = new JavaSparkContext(conf);
sqlContext = new HiveContext(javaSparkContext);
}

public static void main(String[] args) {
  ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest();
arithmeticIssueTest.execute();
}

private void execute(){
List data = Arrays.asList(
"a1,1494389759,99.8793003568,325.389705932",
"a1,1494389759,99.9472573803,325.27559502",
"a1,1494389759,99.7887233987,325.334374851",
"a1,1494389759,99.9547800925,325.371537062",
"a1,1494389759,99.8039111691,325.305285877",
"a1,1494389759,99.8342317379,325.24881354",
"a1,1494389759,99.9849449235,325.396678931",
"a1,1494389759,99.9396731311,325.336115345",
"a1,1494389759,99.9320915068,325.242622938",
"a1,1494389759,99.894669,325.320965146",
"a1,1494389759,99.7735359781,325.345168334",
"a1,1494389759,99.9698837734,325.352291407",
"a1,1494389759,99.8418330703,325.296539372",
"a1,1494389759,99.796315751,325.347570632",
"a1,1494389759,99.7811931613,325.351137315",
"a1,1494389759,99.9773765104,325.218131741",
"a1,1494389759,99.8189825201,325.288197381",
"a1,1494389759,99.8115005369,325.282327633",
"a1,1494389759,99.9924539722,325.24048614",
"a1,1494389759,99.9170191204,325.299431664");
JavaRDD rawData = javaSparkContext.parallelize(data);
List fields = new ArrayList<>();
fields.add(DataTypes.createStructField("ASSET_ID", 
DataTypes.StringType, true));
fields.add(DataTypes.createStructField("TIMESTAMP", DataTypes.LongType, 
true));
fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, 
true));
fields.add(DataTypes.createStructField("temperature", 
DataTypes.DoubleType, true));
StructType schema = DataTypes.createStructType(fields);
JavaRDD rowRDD = rawData.map(
(Function) record -> {
String[] fields1 = record.split(",");
return RowFactory.create(
fields1[0].trim(),
Long.parseLong(fields1[1].trim()),
Double.parseDouble(fields1[2].trim()),
Double.parseDouble(fields1[3].trim()));
});
DataFrame df = sqlContext.createDataFrame(rowRDD, schema);
df.show(false);
df.registerTempTable("x_linkx1087571272_filtered");

sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".temperature=325.0 then 1 else 0 end) AS xsumptionx1582594572, 
max(x_linkx1087571272_filtered" +
".TIMESTAMP) AS eventTime  FROM x_linkx1087571272_filtered 
GROUP BY x_linkx1087571272_filtered" +
".ASSET_ID").show(false);

sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".fuel>99.8 then 1 else 0 end) AS xnsumptionx352569416, 
max(x_linkx1087571272_filtered.TIMESTAMP) AS " +
"eventTime  FROM x_linkx1087571272_filtered GROUP BY 
x_linkx1087571272_filtered.ASSET_ID").show(false);

//+
sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".temperature==325.0 then 1 else 0 end) AS 

[jira] [Updated] (SPARK-20698) =, ==, > is not working as expected when used in sql query

2017-05-10 Thread someshwar kale (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

someshwar kale updated SPARK-20698:
---
Description: 
I have written below spark program- its not working as expected


{code}
package computedBatch;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;


public class ArithmeticIssueTest {
private transient JavaSparkContext javaSparkContext;
private transient SQLContext sqlContext;

public ArithmeticIssueTest() {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkConf conf = new 
SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]");
javaSparkContext = new JavaSparkContext(conf);
sqlContext = new HiveContext(javaSparkContext);
}

public static void main(String[] args) {
  ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest();
arithmeticIssueTest.execute();
}

private void execute(){
List data = Arrays.asList(
"a1,1494389759,99.8793003568,325.389705932",
"a1,1494389759,99.9472573803,325.27559502",
"a1,1494389759,99.7887233987,325.334374851",
"a1,1494389759,99.9547800925,325.371537062",
"a1,1494389759,99.8039111691,325.305285877",
"a1,1494389759,99.8342317379,325.24881354",
"a1,1494389759,99.9849449235,325.396678931",
"a1,1494389759,99.9396731311,325.336115345",
"a1,1494389759,99.9320915068,325.242622938",
"a1,1494389759,99.894669,325.320965146",
"a1,1494389759,99.7735359781,325.345168334",
"a1,1494389759,99.9698837734,325.352291407",
"a1,1494389759,99.8418330703,325.296539372",
"a1,1494389759,99.796315751,325.347570632",
"a1,1494389759,99.7811931613,325.351137315",
"a1,1494389759,99.9773765104,325.218131741",
"a1,1494389759,99.8189825201,325.288197381",
"a1,1494389759,99.8115005369,325.282327633",
"a1,1494389759,99.9924539722,325.24048614",
"a1,1494389759,99.9170191204,325.299431664");
JavaRDD rawData = javaSparkContext.parallelize(data);
List fields = new ArrayList<>();
fields.add(DataTypes.createStructField("ASSET_ID", 
DataTypes.StringType, true));
fields.add(DataTypes.createStructField("TIMESTAMP", DataTypes.LongType, 
true));
fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, 
true));
fields.add(DataTypes.createStructField("temperature", 
DataTypes.DoubleType, true));
StructType schema = DataTypes.createStructType(fields);
JavaRDD rowRDD = rawData.map(
(Function) record -> {
String[] fields1 = record.split(",");
return RowFactory.create(
fields1[0].trim(),
Long.parseLong(fields1[1].trim()),
Double.parseDouble(fields1[2].trim()),
Double.parseDouble(fields1[3].trim()));
});
DataFrame df = sqlContext.createDataFrame(rowRDD, schema);
df.show(false);
df.registerTempTable("x_linkx1087571272_filtered");

sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".temperature=325.0 then 1 else 0 end) AS xsumptionx1582594572, 
max(x_linkx1087571272_filtered" +
".TIMESTAMP) AS eventTime  FROM x_linkx1087571272_filtered 
GROUP BY x_linkx1087571272_filtered" +
".ASSET_ID").show(false);

sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".fuel>99.8 then 1 else 0 end) AS xnsumptionx352569416, 
max(x_linkx1087571272_filtered.TIMESTAMP) AS " +
"eventTime  FROM x_linkx1087571272_filtered GROUP BY 
x_linkx1087571272_filtered.ASSET_ID").show(false);

//+
sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".temperature==325.0 

[jira] [Created] (SPARK-20698) =, ==, > is not working as expected when used in sql query

2017-05-10 Thread someshwar kale (JIRA)
someshwar kale created SPARK-20698:
--

 Summary: =, ==, > is not working as expected when used in sql query
 Key: SPARK-20698
 URL: https://issues.apache.org/jira/browse/SPARK-20698
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
 Environment: windows
Reporter: someshwar kale
Priority: Critical
 Fix For: 1.6.2


I have written below spark program- its not working as expected


package computedBatch;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;


public class ArithmeticIssueTest {
private transient JavaSparkContext javaSparkContext;
private transient SQLContext sqlContext;

public ArithmeticIssueTest() {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkConf conf = new 
SparkConf().setAppName("ArithmeticIssueTest").setMaster("local[4]");
javaSparkContext = new JavaSparkContext(conf);
sqlContext = new HiveContext(javaSparkContext);
}

public static void main(String[] args) {
  ArithmeticIssueTest arithmeticIssueTest = new ArithmeticIssueTest();
arithmeticIssueTest.execute();
}

private void execute(){
List data = Arrays.asList(
"a1,1494389759,99.8793003568,325.389705932",
"a1,1494389759,99.9472573803,325.27559502",
"a1,1494389759,99.7887233987,325.334374851",
"a1,1494389759,99.9547800925,325.371537062",
"a1,1494389759,99.8039111691,325.305285877",
"a1,1494389759,99.8342317379,325.24881354",
"a1,1494389759,99.9849449235,325.396678931",
"a1,1494389759,99.9396731311,325.336115345",
"a1,1494389759,99.9320915068,325.242622938",
"a1,1494389759,99.894669,325.320965146",
"a1,1494389759,99.7735359781,325.345168334",
"a1,1494389759,99.9698837734,325.352291407",
"a1,1494389759,99.8418330703,325.296539372",
"a1,1494389759,99.796315751,325.347570632",
"a1,1494389759,99.7811931613,325.351137315",
"a1,1494389759,99.9773765104,325.218131741",
"a1,1494389759,99.8189825201,325.288197381",
"a1,1494389759,99.8115005369,325.282327633",
"a1,1494389759,99.9924539722,325.24048614",
"a1,1494389759,99.9170191204,325.299431664");
JavaRDD rawData = javaSparkContext.parallelize(data);
List fields = new ArrayList<>();
fields.add(DataTypes.createStructField("ASSET_ID", 
DataTypes.StringType, true));
fields.add(DataTypes.createStructField("TIMESTAMP", DataTypes.LongType, 
true));
fields.add(DataTypes.createStructField("fuel", DataTypes.DoubleType, 
true));
fields.add(DataTypes.createStructField("temperature", 
DataTypes.DoubleType, true));
StructType schema = DataTypes.createStructType(fields);
JavaRDD rowRDD = rawData.map(
(Function) record -> {
String[] fields1 = record.split(",");
return RowFactory.create(
fields1[0].trim(),
Long.parseLong(fields1[1].trim()),
Double.parseDouble(fields1[2].trim()),
Double.parseDouble(fields1[3].trim()));
});
DataFrame df = sqlContext.createDataFrame(rowRDD, schema);
df.show(false);
df.registerTempTable("x_linkx1087571272_filtered");

sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".temperature=325.0 then 1 else 0 end) AS xsumptionx1582594572, 
max(x_linkx1087571272_filtered" +
".TIMESTAMP) AS eventTime  FROM x_linkx1087571272_filtered 
GROUP BY x_linkx1087571272_filtered" +
".ASSET_ID").show(false);

sqlContext.sql("SELECT x_linkx1087571272_filtered.ASSET_ID, count(case 
when x_linkx1087571272_filtered" +
".fuel>99.8 then 1 else 0 end) AS xnsumptionx352569416, 
max(x_linkx1087571272_filtered.TIMESTAMP) AS " +

[jira] [Comment Edited] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005189#comment-16005189
 ] 

Thomas Graves edited comment on SPARK-19354 at 5/10/17 6:37 PM:


[~irashid] wondering if you have seen the issue with blacklisting or perhaps 
there is another fix for that already?  


was (Author: tgraves):
[~squito] wondering if you have seen the issue with blacklisting or perhaps 
there is another fix for that already?

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at 

[jira] [Updated] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-19354:
--
Priority: Major  (was: Minor)

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659)
>   at 
> 

[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005189#comment-16005189
 ] 

Thomas Graves commented on SPARK-19354:
---

[~squito] wondering if you have seen the issue with blacklisting or perhaps 
there is another fix for that already?

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 

[jira] [Updated] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-19354:
--
Issue Type: Bug  (was: Improvement)

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>Priority: Minor
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659)
>   at 
> 

[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-10 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005182#comment-16005182
 ] 

Thomas Graves commented on SPARK-19354:
---

This is definitely causing issues with blacklisting.  speculative tasks can 
come back with failed like:

17/05/10 17:14:38 ERROR Executor: Exception in task 71476.0 in stage 52.0 (TID 
317274)
java.nio.channels.ClosedByInterruptException

Then executor gets marked as blacklisted: 
https://github.com/apache/spark/blob/branch-2.1/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L793


> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>Priority: Minor
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> 

[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-10 Thread Ignacio Bermudez Corrales (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005179#comment-16005179
 ] 

Ignacio Bermudez Corrales commented on SPARK-20687:
---

Proposing a patch in PR https://github.com/apache/spark/pull/17940

> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-10 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004607#comment-16004607
 ] 

Zoltan Ivanfi edited comment on SPARK-12297 at 5/10/17 6:26 PM:


bq. It'd be great to consider this more holistically and think about 
alternatives in fixing them

As Ryan mentioned, the Parquet community discussed this timestamp 
incompatibilty problem with the aim of avoiding similar problems in the future. 
It was decided that the specification needs to include two separate types with 
well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT 
TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. 
(Otherwise implementors would be tempted to misuse the single existing type for 
storing timestamps of different semantics, as it already happened with the 
int96 timestamp type).

Using these two types, SQL engines will be able to unambiguously store their 
timestamp type regardless of its semantics. However, the TIMESTAMP type should 
follow TIMESTAMP WITHOUT TIMEZONE semantics for consistency with other SQL 
engines. The TIMESTAMP WITH TIMEZONE semantics should be implemented as a new 
SQL type with a matching name.

While this is a nice and clean long-term solution, a short-term fix is also 
desired until the new types become widely supported and/or to allow dealing 
with existing data. The commit in question is a part of this short-term fix and 
it allows getting correct values when reading int96 timestamps, even for data 
written by other components.

bq. it completely changes the behavior of one of the most important data types.

A very important aspect of this fix is that it does not change SparkSQL's 
behavior unless the user sets a table property, so it's a completely safe and 
non-breaking change.

bq. One of the fundamental problem is that Spark treats timestamp as timestamp 
with timezone, whereas impala treats timestamp as timestamp without timezone. 
The parquet storage is only a small piece here.

The fix only addresses Parquet timestamps indeed. This, however, is intentional 
and is not a limitation, neither an inconsistency as the problem seems to be 
specific to Parquet. My understanding is that for other file formats, SparkSQL 
follows timezone-agnostic (TIMESTAMP WITHOUT TIMEZONE) semantics and my 
experiments with the CSV and Avro formats seem to confirm this. So using 
UTC-normalized (TIMESTAMP WITH TIMEZONE) semantics in Parquet is not only 
incompatible with Impala but is also inconsistent within SparkSQL itself.

bq. Also this is not just a Parquet issue. The same issue could happen to all 
data formats. It is going to be really confusing to have something that only 
works for Parquet

The current behavior of SparkSQL already seems to be different for Parquet than 
for other formats. The fix allows the user to choose a consistent and less 
confusing behaviour instead. It also makes Impala, Hive and SparkSQL compatible 
with each other regarding int96 timestamps.

bq. It seems like the purpose of this patch can be accomplished by just setting 
the session local timezone to UTC?

Unfortunately that would not suffice. The problem has to addressed in all SQL 
engines. As of today, Hive and Impala already contains the changes that allow 
interoperability using the parquet.mr.int96.write.zone table property:

* Hive:
** 
https://github.com/apache/hive/commit/84fdc1c7c8ff0922aa44f829dbfa9659935c503e
** 
https://github.com/apache/hive/commit/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5
** 
https://github.com/apache/hive/commit/2dfcea5a95b7d623484b8be50755b817fbc91ce0
** 
https://github.com/apache/hive/commit/78e29fc70dacec498c35dc556dd7403e4c9f48fe
* Impala:
** 
https://github.com/apache/incubator-impala/commit/5803a0b0744ddaee6830d4a1bc8dba8d3f2caa26



was (Author: zi):
bq. It'd be great to consider this more holistically and think about 
alternatives in fixing them

As Ryan mentioned, the Parquet community discussed this timestamp 
incompatibilty problem with the aim of avoiding similar problems in the future. 
It was decided that the specification needs to include two separate types with 
well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT 
TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. 
(Otherwise implementors would be tempted to misuse the single existing type for 
storing timestamps of different semantics, as it already happened with the 
int96 timestamp type).

While this is a nice and clean long-term solution, a short-term fix is also 
desired until the new types become widely supported and/or to allow dealing 
with existing data. The commit in question is a part of this short-term fix and 
it allows getting correct values when reading int96 timestamps, even for data 
written by other components.

bq. it completely changes the behavior of one of the most important data 

[jira] [Assigned] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20504:
-

Assignee: Joseph K. Bradley

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2017-05-10 Thread Abhishek Madav (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
---
Description: 
MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:  devbld   
CreateTime: Wed May 10 10:31:07 PDT 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   hdfs://localhost:8020/user/hive/warehouse/partbucket 
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1494437467  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:10   
Bucket Columns: [a]  
Sort Columns:   []   
Storage Desc Params: 
field.delim ,   
serialization.format, 

3) In spark-shell, 

scala> spark.sql("MSCK REPAIR TABLE partbucket")

4) Back to Hive-CLI 

desc formatted partbucket;

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:  devbld   
CreateTime: Wed May 10 10:31:07 PDT 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   
hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.partitionProvider catalog 
transient_lastDdlTime   1494437647  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
field.delim ,   
serialization.format, 


Further inserts to this table cannot be made in bucketed fashion through Hive. 

  was:
MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:   

[jira] [Created] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2017-05-10 Thread Abhishek Madav (JIRA)
Abhishek Madav created SPARK-20697:
--

 Summary: MSCK REPAIR TABLE resets the Storage Information for 
bucketed hive tables.
 Key: SPARK-20697
 URL: https://issues.apache.org/jira/browse/SPARK-20697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Abhishek Madav


MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
does not restore the bucketing information to the storage descriptor in the 
metastore. 

Steps to reproduce:
1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

2) In Hive-CLI issue a desc formatted for the table.

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:  devbld   
CreateTime: Wed May 10 10:31:07 PDT 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   hdfs://localhost:8020/user/hive/warehouse/partbucket 
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1494437467  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:10   
Bucket Columns: [a]  
Sort Columns:   []   
Storage Desc Params: 
field.delim ,   
serialization.format, 

3) In spark-shell, 

scala> spark.sql("MSCK REPAIR TABLE partbucket")

4) Back to Hive-CLI 

desc formatted partbucket;

# col_name  data_type   comment 
 
a   int 
 
# Partition Information  
# col_name  data_type   comment 
 
b   int 
 
# Detailed Table Information 
Database:   sparkhivebucket  
Owner:  devbld   
CreateTime: Wed May 10 10:31:07 PDT 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   
hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
Table Type: MANAGED_TABLE
Table Parameters:
spark.sql.partitionProvider catalog 
transient_lastDdlTime   1494437647  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
field.delim ,   
serialization.format, 






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19213) FileSourceScanExec uses SparkSession from HadoopFsRelation creation time instead of the active session at execution time

2017-05-10 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski closed SPARK-19213.
-
Resolution: Won't Fix

Not really necessary and can lead to confusing results

> FileSourceScanExec uses SparkSession from HadoopFsRelation creation time 
> instead of the active session at execution time
> 
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Robert Kruszewski
>
> If you look at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
>  you'll notice that the sparksession used for execution is the one that was 
> captured from logicalplan. Whereas in other places you have 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
>  and SparkPlan captures active session upon execution in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the 
> sparksession that is currently active hence take the one from spark plan. 
> However, in case you want share Datasets across SparkSessions that is not 
> enough since as soon as dataset is executed the queryexecution will have 
> capture spark session at that point. If we want to share datasets across 
> users we need to make configurations not fixed upon first execution. I 
> consider 1st part (using sparksession from logical plan) a bug while the 
> second (using sparksession active at runtime) an enhancement so that sharing 
> across sessions is made easier.
> For example:
> {code}
> val df = spark.read.parquet(...)
> df.count()
> val newSession = spark.newSession()
> SparkSession.setActiveSession(newSession)
> //  (simplest one to try is disable 
> vectorized reads)
> val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still 
> holds reference to original sparksession and changes don't take effect
> {code}
> I suggest that it shouldn't be necessary to create a new dataset for changes 
> to take effect. For most of the plans doing Dataset.ofRows work but this is 
> not the case for hadoopfsrelation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-10 Thread Ignacio Bermudez Corrales (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004975#comment-16004975
 ] 

Ignacio Bermudez Corrales commented on SPARK-20687:
---

When you try to do operations like addition or subtraction between 2 
mllib.distributed.BlockMatrices that store in blocks sparse matrices, these are 
operated using breeze and then converted back to Matrices again. Sometimes this 
conversion back produces crashes, even though the resulting matrix is valid, 
because this method in Matrices.fromBreeze doesn't extract correctly the data 
hold in CSC breeze matrix.

Unfortunately, I'm not able to show some code with block matrices, but I can 
show you some backtrace. I manually debugged the crashes, and found the 
culprit, so that's why I posted in the description a quite more simplified 
snippet that reproduces the error.

The snippet that causes the crash in BlockMatrix lines 374-379

{code:title:BlockMatrix.scala:blockMap}
  } else if (b.isEmpty) {
new MatrixBlock((blockRowIndex, blockColIndex), a.head)
  } else {
val result = binMap(a.head.asBreeze, b.head.asBreeze)
new MatrixBlock((blockRowIndex, blockColIndex), 
Matrices.fromBreeze(result)) // <--not able to get results
  }
{code}


The trace after the operation between 2 spark block matrices:

{code:text}
Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 31.0 (TID 34, localhost, executor 
driver): java.lang.IllegalArgumentException: requirement failed: The last value 
of colPtrs must equal the number of elements. values.length: 28, colPtrs.last: 
15
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.mllib.linalg.SparseMatrix.(Matrices.scala:590)
at org.apache.spark.mllib.linalg.SparseMatrix.(Matrices.scala:618)
at 
org.apache.spark.mllib.linalg.Matrices$.fromBreeze(Matrices.scala:995)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$10.apply(BlockMatrix.scala:378)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$10.apply(BlockMatrix.scala:365)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at 
org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2119)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2119)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 

[jira] [Resolved] (SPARK-20689) python doctest leaking bucketed table

2017-05-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20689.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> python doctest leaking bucketed table
> -
>
> Key: SPARK-20689
> URL: https://issues.apache.org/jira/browse/SPARK-20689
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.3.0
>
>
> When trying to address build test failure in SPARK-20661 we discovered some 
> tables are unexpectedly left behind causing R tests to fail. While we changed 
> the R tests to be more resilient, we investigated further to see what was 
> creating those tables.
> It turns out pyspark doctest is calling saveAsTable without ever dropping 
> them. Since we have separate python tests for bucketed table, and we don't 
> check for result in doctest, there is really no need to run the doctest 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20680) Spark-sql do not support for void column datatype of view

2017-05-10 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004911#comment-16004911
 ] 

Herman van Hovell commented on SPARK-20680:
---

[~jiangxb] Do you have time to work on this?

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Lantao Jin
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-10 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004778#comment-16004778
 ] 

Yuechen Chen commented on SPARK-20608:
--

I know what you mean and that's exactly right. 
But since Spark provide "yarn.spark.access.namenodes" config, Spark may 
recommend two ways to support saving data to remote HDFS:
1) as you said, by config remote namespace mapping in hdfs-site.xml, and just 
submit it to Spark without any SparkConf.(may be partly recommended for HA)
2) by config yarn.spark.access.namenodes=remotehdfs.(may support HA not well)
For the second way,  if standby namenodes is allowed to be include in 
yarn.spark.access.namenodes, this is easier way to make HA, even though Spark 
App may still failed if namenode failover during the job of saving to remote 
HDFS.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20696.
---
Resolution: Invalid

This isn't a good place to ask, as it's almost surely a question about your 
data, not a problem in Spark. user@spark or stackoverflow is better.

> tf-idf document clustering with K-means in Apache Spark putting points into 
> one cluster
> ---
>
> Key: SPARK-20696
> URL: https://issues.apache.org/jira/browse/SPARK-20696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nassir
>
> I am trying to do the classic job of clustering text documents by 
> pre-processing, generating tf-idf matrix, and then applying K-means. However, 
> testing this workflow on the classic 20NewsGroup dataset results in most 
> documents being clustered into one cluster. (I have initially tried to 
> cluster all documents from 6 of the 20 groups - so expecting to cluster into 
> 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this 
> technique on millions of documents. Here is the code written in Pyspark on 
> Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
> outputCol="rawFeatures", numFeatures=20)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I 
> cannot figure out what I am doing wrong as all the tutorials and code I have 
> come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means 
> but that also produces the same result. I know cosine distance is a better 
> measure to use, but I expected using standard K-means in Apache Spark would 
> provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if 
> something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: 
> http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2017-05-10 Thread Nassir (JIRA)
Nassir created SPARK-20696:
--

 Summary: tf-idf document clustering with K-means in Apache Spark 
putting points into one cluster
 Key: SPARK-20696
 URL: https://issues.apache.org/jira/browse/SPARK-20696
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Nassir


I am trying to do the classic job of clustering text documents by 
pre-processing, generating tf-idf matrix, and then applying K-means. However, 
testing this workflow on the classic 20NewsGroup dataset results in most 
documents being clustered into one cluster. (I have initially tried to cluster 
all documents from 6 of the 20 groups - so expecting to cluster into 6 
clusters).

I am implementing this in Apache Spark as my purpose is to utilise this 
technique on millions of documents. Here is the code written in Pyspark on 
Databricks:

#declare path to folder containing 6 of 20 news group categories
path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
MOUNT_NAME

#read all the text files from the 6 folders. Each entity is an entire 
document. 
text_files = sc.wholeTextFiles(path).cache()

#convert rdd to dataframe
df = text_files.toDF(["filePath", "document"]).cache()

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 

#tokenize the document text
tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
tokenized = tokenizer.transform(df).cache()

from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="tokens", 
outputCol="stopWordsRemovedTokens")
stopWordsRemoved_df = remover.transform(tokenized).cache()

hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
outputCol="rawFeatures", numFeatures=20)
tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
idfModel = idf.fit(tfVectors)

tfIdfVectors = idfModel.transform(tfVectors).cache()

#note that I have also tried to use normalized data, but get the same result
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
l2NormData = normalizer.transform(tfIdfVectors)

from pyspark.ml.clustering import KMeans

# Trains a KMeans model.
kmeans = KMeans().setK(6).setMaxIter(20)
km_model = kmeans.fit(l2NormData)

clustersTable = km_model.transform(l2NormData)

[output showing most documents get clustered into cluster 0][1]

ID number_of_documents_in_cluster 
0 3024 
3 5 
1 3 
5 2
2 2 
4 1

As you can see most of my data points get clustered into cluster 0, and I 
cannot figure out what I am doing wrong as all the tutorials and code I have 
come across online point to using this method.

In addition I have also tried normalizing the tf-idf matrix before K-means but 
that also produces the same result. I know cosine distance is a better measure 
to use, but I expected using standard K-means in Apache Spark would provide 
meaningful results.

Can anyone help with regards to whether I have a bug in my code, or if 
something is missing in my data clustering pipeline?

(Question also asked in Stackoverflow before: 
http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)

Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2017-05-10 Thread Nick Hryhoriev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004733#comment-16004733
 ] 

Nick Hryhoriev commented on SPARK-5594:
---

Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of 
 SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after 
restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4 
times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job 
149442305 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 
46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1): 
java.io.IOException: org.apache.spark.SparkException: Failed to get 
broadcast_141_piece0 of broadcast_141
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
at 
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
at 
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at 
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
at 
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at 
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at 
com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
at 
com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at 
com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at 
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at 
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 
of broadcast_141
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at 

[jira] [Updated] (SPARK-20622) Parquet partition discovery for non key=value named directories

2017-05-10 Thread Noam Asor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noam Asor updated SPARK-20622:
--
Description: 
h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. How
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("partitionTemplate", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
   |--path
 |--2016
  |--...
 |--2017
   |--01
   |--02
   |--...
   |--15
   |--...
   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.

  was:
h4. Why
There are cases where traditional M/R jobs and RDD based Spark jobs writes out 
partitioned parquet in 'value only' named directories i.e. 
{{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named directories 
i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which prevents users 
from leveraging Spark SQL parquet partition discovery when reading the former 
back.
h4. What
This issue is a proposal for a solution which will allow Spark SQL to discover 
parquet partitions for 'value only' named directories.
h4. how
By introducing a new Spark SQL read option *partitionTemplate*.
*partitionTemplate* is in a Path form and it should include base path followed 
by the missing 'key=' as a template for transforming 'value only' named dirs to 
'key=value' named dirs. In the example above this will look like: 
{{hdfs:///some/base/path/year=/month=/day=/}}.

To simplify the solution this option should be tied with *basePath* option, 
meaning that *partitionTemplate* option is valid only if *basePath* is set also.
In the end for the above scenario, this will look something like:
{code}
spark.read
  .option("basePath", "hdfs:///some/base/path")
  .option("basePath", "hdfs:///some/base/path/year=/month=/day=/")
  .parquet(...)
{code}
which will allow Spark SQL to do parquet partition discovery on the following 
directory tree:
{code}
some
  |--base
   |--path
 |--2016
  |--...
 |--2017
   |--01
   |--02
   |--...
   |--15
   |--...
   |--...
{code}
adding to the schema of the resulted DataFrame the columns year, month, day and 
their respective values as expected.


> Parquet partition discovery for non key=value named directories
> ---
>
> Key: SPARK-20622
> URL: https://issues.apache.org/jira/browse/SPARK-20622
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Noam Asor
>
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes 
> out partitioned parquet in 'value only' named directories i.e. 
> {{hdfs:///some/base/path/2017/05/06}} and not in 'key=value' named 
> directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}} which 
> prevents users from leveraging Spark SQL parquet partition discovery when 
> reading the former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to 
> discover parquet partitions for 'value only' named directories.
> h4. How
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path 
> followed by the missing 'key=' as a template for transforming 'value only' 
> named dirs to 'key=value' named dirs. In the example above this will look 
> 

[jira] [Updated] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error

2017-05-10 Thread Peter Mead (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Mead updated SPARK-20695:
---

Do you guys ever read the issues?This is a simple spark shell script (dse -u 
 -p yyy spark) which works perfectly fine if I only have a single text 
socket stream BUT fails immediately as soon as I intoduce a second socket text 
stream even if I never reference it. As for registering classes I have no idea 
what class 13994 is!!Not very helpfull!



 Original message 
From: "Sean Owen (JIRA)"  
Date: 10/05/2017  15:40  (GMT+01:00) 
To: pjm...@blueyonder.co.uk 
Subject: [jira] [Resolved] (SPARK-20695) Running multiple TCP socket streams
  in Spark Shell causes driver error 


 [ 
https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20695.
---
    Resolution: Invalid

I don't believe that's anything to do with TCP; you are enabling Kryo 
registration but didn't register some class you are serializing. This is a 
question about debugging your app and shouldn't be a Spark JIRA.

You need to read http://spark.apache.org/contributing.html too; you would never 
set Blocker for example.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


> Running multiple TCP socket streams in Spark Shell causes driver error
> --
>
> Key: SPARK-20695
> URL: https://issues.apache.org/jira/browse/SPARK-20695
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Spark Core, Spark Shell, Structured Streaming
>Affects Versions: 2.0.2
> Environment: DataStax DSE apache 3 node cassandra running with 
> analytics on RHEL 7.3 on Hyper-V windows 10 laptop.
>Reporter: Peter Mead
>Priority: Blocker
>
> Whenever I include a second socket stream (lines02) the script errors if I am 
> not trying to process data. If I remove the lines02 the script runs fine!!
> script:
> val s_server01="192.168.1.10"
> val s_port01  = 8801
> val s_port02  = 8802
> import org.apache.spark.streaming._, 
> org.apache.spark.streaming.StreamingContext._
> import scala.util.Random
> import org.apache.spark._
> import org.apache.spark.storage._
> import org.apache.spark.streaming.receiver._
> import java.util.Date;
> import java.text.SimpleDateFormat;
> import java.util.Calendar;
> import sys.process._
> import org.apache.spark.streaming.dstream.ConstantInputDStream
> sc.setLogLevel("ERROR")
> val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss")
> var processed:Long = 0
> var pdate=""
> case class t_row (card_number: String, event_date: Int, event_time: Int, 
> processed: Long, transport_type: String, card_credit: java.lang.Float, 
> transport_location: String, journey_type: Int,  journey_value: 
> java.lang.Float)
> var type2tot = 0
> var type5tot = 0
> var numb=0
> var total_secs:Double = 0
> val red= "\033[0;31m"
> val green  = "\033[0;32m"
> val cyan   = "\033[0;36m"
> val yellow = "\033[0;33m"
> val nocolour = "\033[0;0m"
> var color = ""
> val t_int = 5
> var init = 0
> var tot_cnt:Long = 0
> val ssc = new StreamingContext(sc, Seconds(t_int))
> val lines01 = ssc.socketTextStream(s_server01, s_port01)
> val lines02 = ssc.socketTextStream(s_server01, s_port02)
> // val lines   = lines01.union(lines02)
> val line01 = lines01.foreachRDD( rdd => {
> println("\nline 01")
> if (init == 0) {"clear".!;init = 1}
> val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
> val processed = System.currentTimeMillis
> val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
> System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
> line(6).toInt, line(7).toFloat ))
> val cnt:Long = bb.count
> bb.saveToCassandra("transport", "card_data_input")
> })
> //val line02 = lines02.foreachRDD( rdd => {
> //println("line 02")
> //if (init == 0) {"clear".!;init = 1}
> //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
> //xx.collect.foreach(println)
> //val processed = System.currentTimeMillis
> //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
> System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
> //line(6).toInt, line(7).toFloat ))
> //val cnt:Long = bb.count
> //bb.saveToCassandra("transport", "card_data_input")
> //})
> ERROR:
> software.kryo.KryoException: Encountered unregistered class ID: 13994
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> 

[jira] [Resolved] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20695.
---
Resolution: Invalid

I don't believe that's anything to do with TCP; you are enabling Kryo 
registration but didn't register some class you are serializing. This is a 
question about debugging your app and shouldn't be a Spark JIRA.

You need to read http://spark.apache.org/contributing.html too; you would never 
set Blocker for example.

> Running multiple TCP socket streams in Spark Shell causes driver error
> --
>
> Key: SPARK-20695
> URL: https://issues.apache.org/jira/browse/SPARK-20695
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Spark Core, Spark Shell, Structured Streaming
>Affects Versions: 2.0.2
> Environment: DataStax DSE apache 3 node cassandra running with 
> analytics on RHEL 7.3 on Hyper-V windows 10 laptop.
>Reporter: Peter Mead
>Priority: Blocker
>
> Whenever I include a second socket stream (lines02) the script errors if I am 
> not trying to process data. If I remove the lines02 the script runs fine!!
> script:
> val s_server01="192.168.1.10"
> val s_port01  = 8801
> val s_port02  = 8802
> import org.apache.spark.streaming._, 
> org.apache.spark.streaming.StreamingContext._
> import scala.util.Random
> import org.apache.spark._
> import org.apache.spark.storage._
> import org.apache.spark.streaming.receiver._
> import java.util.Date;
> import java.text.SimpleDateFormat;
> import java.util.Calendar;
> import sys.process._
> import org.apache.spark.streaming.dstream.ConstantInputDStream
> sc.setLogLevel("ERROR")
> val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss")
> var processed:Long = 0
> var pdate=""
> case class t_row (card_number: String, event_date: Int, event_time: Int, 
> processed: Long, transport_type: String, card_credit: java.lang.Float, 
> transport_location: String, journey_type: Int,  journey_value: 
> java.lang.Float)
> var type2tot = 0
> var type5tot = 0
> var numb=0
> var total_secs:Double = 0
> val red= "\033[0;31m"
> val green  = "\033[0;32m"
> val cyan   = "\033[0;36m"
> val yellow = "\033[0;33m"
> val nocolour = "\033[0;0m"
> var color = ""
> val t_int = 5
> var init = 0
> var tot_cnt:Long = 0
> val ssc = new StreamingContext(sc, Seconds(t_int))
> val lines01 = ssc.socketTextStream(s_server01, s_port01)
> val lines02 = ssc.socketTextStream(s_server01, s_port02)
> // val lines   = lines01.union(lines02)
> val line01 = lines01.foreachRDD( rdd => {
> println("\nline 01")
> if (init == 0) {"clear".!;init = 1}
> val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
> val processed = System.currentTimeMillis
> val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
> System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
> line(6).toInt, line(7).toFloat ))
> val cnt:Long = bb.count
> bb.saveToCassandra("transport", "card_data_input")
> })
> //val line02 = lines02.foreachRDD( rdd => {
> //println("line 02")
> //if (init == 0) {"clear".!;init = 1}
> //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
> //xx.collect.foreach(println)
> //val processed = System.currentTimeMillis
> //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
> System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
> //line(6).toInt, line(7).toFloat ))
> //val cnt:Long = bb.count
> //bb.saveToCassandra("transport", "card_data_input")
> //})
> ERROR:
> software.kryo.KryoException: Encountered unregistered class ID: 13994
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)...etc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-20695.
-

> Running multiple TCP socket streams in Spark Shell causes driver error
> --
>
> Key: SPARK-20695
> URL: https://issues.apache.org/jira/browse/SPARK-20695
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Spark Core, Spark Shell, Structured Streaming
>Affects Versions: 2.0.2
> Environment: DataStax DSE apache 3 node cassandra running with 
> analytics on RHEL 7.3 on Hyper-V windows 10 laptop.
>Reporter: Peter Mead
>Priority: Blocker
>
> Whenever I include a second socket stream (lines02) the script errors if I am 
> not trying to process data. If I remove the lines02 the script runs fine!!
> script:
> val s_server01="192.168.1.10"
> val s_port01  = 8801
> val s_port02  = 8802
> import org.apache.spark.streaming._, 
> org.apache.spark.streaming.StreamingContext._
> import scala.util.Random
> import org.apache.spark._
> import org.apache.spark.storage._
> import org.apache.spark.streaming.receiver._
> import java.util.Date;
> import java.text.SimpleDateFormat;
> import java.util.Calendar;
> import sys.process._
> import org.apache.spark.streaming.dstream.ConstantInputDStream
> sc.setLogLevel("ERROR")
> val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss")
> var processed:Long = 0
> var pdate=""
> case class t_row (card_number: String, event_date: Int, event_time: Int, 
> processed: Long, transport_type: String, card_credit: java.lang.Float, 
> transport_location: String, journey_type: Int,  journey_value: 
> java.lang.Float)
> var type2tot = 0
> var type5tot = 0
> var numb=0
> var total_secs:Double = 0
> val red= "\033[0;31m"
> val green  = "\033[0;32m"
> val cyan   = "\033[0;36m"
> val yellow = "\033[0;33m"
> val nocolour = "\033[0;0m"
> var color = ""
> val t_int = 5
> var init = 0
> var tot_cnt:Long = 0
> val ssc = new StreamingContext(sc, Seconds(t_int))
> val lines01 = ssc.socketTextStream(s_server01, s_port01)
> val lines02 = ssc.socketTextStream(s_server01, s_port02)
> // val lines   = lines01.union(lines02)
> val line01 = lines01.foreachRDD( rdd => {
> println("\nline 01")
> if (init == 0) {"clear".!;init = 1}
> val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
> val processed = System.currentTimeMillis
> val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
> System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
> line(6).toInt, line(7).toFloat ))
> val cnt:Long = bb.count
> bb.saveToCassandra("transport", "card_data_input")
> })
> //val line02 = lines02.foreachRDD( rdd => {
> //println("line 02")
> //if (init == 0) {"clear".!;init = 1}
> //val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
> //xx.collect.foreach(println)
> //val processed = System.currentTimeMillis
> //val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
> System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
> //line(6).toInt, line(7).toFloat ))
> //val cnt:Long = bb.count
> //bb.saveToCassandra("transport", "card_data_input")
> //})
> ERROR:
> software.kryo.KryoException: Encountered unregistered class ID: 13994
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)...etc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-05-10 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004688#comment-16004688
 ] 

Barry Becker commented on SPARK-20542:
--

@viirya, your implementation of MultipleBucketizer relies on a withColumns 
method on dataframe. That method is not in 2.1.1 or 2.2. In which release will 
it be available?

> Add an API into Bucketizer that can bin a lot of columns all at once
> 
>
> Key: SPARK-20542
> URL: https://issues.apache.org/jira/browse/SPARK-20542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> Current ML's Bucketizer can only bin a column of continuous features. If a 
> dataset has thousands of of continuous columns needed to bin, we will result 
> in thousands of ML stages. It is very inefficient regarding query planning 
> and execution.
> We should have a type of bucketizer that can bin a lot of columns all at 
> once. It would need to accept an list of arrays of split points to correspond 
> to the columns to bin, but it might make things more efficient by replacing 
> thousands of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20695) Running multiple TCP socket streams in Spark Shell causes driver error

2017-05-10 Thread Peter Mead (JIRA)
Peter Mead created SPARK-20695:
--

 Summary: Running multiple TCP socket streams in Spark Shell causes 
driver error
 Key: SPARK-20695
 URL: https://issues.apache.org/jira/browse/SPARK-20695
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Spark Core, Spark Shell, Structured Streaming
Affects Versions: 2.0.2
 Environment: DataStax DSE apache 3 node cassandra running with 
analytics on RHEL 7.3 on Hyper-V windows 10 laptop.
Reporter: Peter Mead
Priority: Blocker


Whenever I include a second socket stream (lines02) the script errors if I am 
not trying to process data. If I remove the lines02 the script runs fine!!
script:
val s_server01="192.168.1.10"
val s_port01  = 8801
val s_port02  = 8802
import org.apache.spark.streaming._, 
org.apache.spark.streaming.StreamingContext._
import scala.util.Random
import org.apache.spark._
import org.apache.spark.storage._
import org.apache.spark.streaming.receiver._
import java.util.Date;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import sys.process._
import org.apache.spark.streaming.dstream.ConstantInputDStream
sc.setLogLevel("ERROR")
val df2 = new SimpleDateFormat("dd-MM- HH:mm:ss")
var processed:Long = 0
var pdate=""
case class t_row (card_number: String, event_date: Int, event_time: Int, 
processed: Long, transport_type: String, card_credit: java.lang.Float, 
transport_location: String, journey_type: Int,  journey_value: java.lang.Float)
var type2tot = 0
var type5tot = 0
var numb=0
var total_secs:Double = 0
val red= "\033[0;31m"
val green  = "\033[0;32m"
val cyan   = "\033[0;36m"
val yellow = "\033[0;33m"
val nocolour = "\033[0;0m"
var color = ""
val t_int = 5
var init = 0
var tot_cnt:Long = 0
val ssc = new StreamingContext(sc, Seconds(t_int))
val lines01 = ssc.socketTextStream(s_server01, s_port01)
val lines02 = ssc.socketTextStream(s_server01, s_port02)
// val lines   = lines01.union(lines02)

val line01 = lines01.foreachRDD( rdd => {
println("\nline 01")
if (init == 0) {"clear".!;init = 1}
val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
val processed = System.currentTimeMillis
val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
line(6).toInt, line(7).toFloat ))
val cnt:Long = bb.count
bb.saveToCassandra("transport", "card_data_input")
})

//val line02 = lines02.foreachRDD( rdd => {
//println("line 02")
//if (init == 0) {"clear".!;init = 1}
//val xx=rdd.map(bb => (bb.substring(0,bb.length).split(",")))
//xx.collect.foreach(println)
//val processed = System.currentTimeMillis
//val bb = xx.map(line => t_row(line(0), line(1).toInt, line(2).toInt, 
System.currentTimeMillis, line(3), line(4).toFloat, line(5), 
//line(6).toInt, line(7).toFloat ))
//val cnt:Long = bb.count
//bb.saveToCassandra("transport", "card_data_input")
//})

ERROR:
software.kryo.KryoException: Encountered unregistered class ID: 13994
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)...etc




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19447) Fix input metrics for range operator

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004631#comment-16004631
 ] 

Apache Spark commented on SPARK-19447:
--

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/17939

> Fix input metrics for range operator
> 
>
> Key: SPARK-19447
> URL: https://issues.apache.org/jira/browse/SPARK-19447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Reynold Xin
>Assignee: Ala Luszczak
> Fix For: 2.2.0
>
>
> Range operator currently does not output any input metrics, and as a result 
> in the SQL UI the number of rows shown is always 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-10 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004607#comment-16004607
 ] 

Zoltan Ivanfi commented on SPARK-12297:
---

bq. It'd be great to consider this more holistically and think about 
alternatives in fixing them

As Ryan mentioned, the Parquet community discussed this timestamp 
incompatibilty problem with the aim of avoiding similar problems in the future. 
It was decided that the specification needs to include two separate types with 
well-defined semantics: one for timezone-agnostic (aka. TIMESTAMP WITHOUT 
TIMEZONE) and one for UTC-normalized (aka. TIMESTAMP WITH TIMEZONE) timestamps. 
(Otherwise implementors would be tempted to misuse the single existing type for 
storing timestamps of different semantics, as it already happened with the 
int96 timestamp type).

While this is a nice and clean long-term solution, a short-term fix is also 
desired until the new types become widely supported and/or to allow dealing 
with existing data. The commit in question is a part of this short-term fix and 
it allows getting correct values when reading int96 timestamps, even for data 
written by other components.

bq. it completely changes the behavior of one of the most important data types.

A very important aspect of this fix is that it does not change SparkSQL's 
behavior unless the user sets a table property, so it's a completely safe and 
non-breaking change.

bq. One of the fundamental problem is that Spark treats timestamp as timestamp 
with timezone, whereas impala treats timestamp as timestamp without timezone. 
The parquet storage is only a small piece here.

The fix only addresses Parquet timestamps indeed. This, however, is intentional 
and is not a limitation, neither an inconsistency. The problem in fact is 
specific to Parquet. For other file formats (for example CSV or Avro), SparkSQL 
follows timezone-agnostic (TIMESTAMP WITHOUT TIMEZONE) semantics. So using 
UTC-normalized (TIMESTAMP WITH TIMEZONE) semantics in Parquet is not only 
incompatible with Impala but is also inconsistent within SparkSQL itself.

bq. Also this is not just a Parquet issue. The same issue could happen to all 
data formats. It is going to be really confusing to have something that only 
works for Parquet

In fact the current behavior of SparkSQL is different for Parquet than for 
other formats. The fix allows the user to choose a consistent and less 
confusing behaviour instead. It also makes Impala, Hive and SparkSQL compatible 
with each other regarding int96 timestamps.

bq. It seems like the purpose of this patch can be accomplished by just setting 
the session local timezone to UTC?

Unfortunately that would not suffice. The problem has to addressed in all SQL 
engines. As of today, Hive and Impala already contains the changes that allow 
interoperability using the parquet.mr.int96.write.zone table property:

* Hive:
** 
https://github.com/apache/hive/commit/84fdc1c7c8ff0922aa44f829dbfa9659935c503e
** 
https://github.com/apache/hive/commit/a1cbccb8dad1824f978205a1e93ec01e87ed8ed5
** 
https://github.com/apache/hive/commit/2dfcea5a95b7d623484b8be50755b817fbc91ce0
** 
https://github.com/apache/hive/commit/78e29fc70dacec498c35dc556dd7403e4c9f48fe
* Impala:
** 
https://github.com/apache/incubator-impala/commit/5803a0b0744ddaee6830d4a1bc8dba8d3f2caa26


> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val 

[jira] [Assigned] (SPARK-20678) Ndv for columns not in filter condition should also be updated

2017-05-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20678:
---

Assignee: Zhenhua Wang

> Ndv for columns not in filter condition should also be updated
> --
>
> Key: SPARK-20678
> URL: https://issues.apache.org/jira/browse/SPARK-20678
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.1, 2.3.0
>
>
> In filter estimation, we update column stats for those columns in filter 
> condition. However, if the number of rows decreases after the filter (i.e. 
> the overall selectivity is less than 1), we need to update (scale down) the 
> number of distinct values (NDV) for all columns, no matter they are in filter 
> conditions or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20694:


Assignee: Apache Spark

> Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
> --
>
> Key: SPARK-20694
> URL: https://issues.apache.org/jira/browse/SPARK-20694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> - Spark SQL, DataFrames and Datasets Guide should contain a section about 
> partitioned, sorted and bucketed writes.
> - Bucketing should be removed from Unsupported Hive Functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20678) Ndv for columns not in filter condition should also be updated

2017-05-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20678.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 17918
[https://github.com/apache/spark/pull/17918]

> Ndv for columns not in filter condition should also be updated
> --
>
> Key: SPARK-20678
> URL: https://issues.apache.org/jira/browse/SPARK-20678
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
> Fix For: 2.2.1, 2.3.0
>
>
> In filter estimation, we update column stats for those columns in filter 
> condition. However, if the number of rows decreases after the filter (i.e. 
> the overall selectivity is less than 1), we need to update (scale down) the 
> number of distinct values (NDV) for all columns, no matter they are in filter 
> conditions or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide

2017-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20694:


Assignee: (was: Apache Spark)

> Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
> --
>
> Key: SPARK-20694
> URL: https://issues.apache.org/jira/browse/SPARK-20694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> - Spark SQL, DataFrames and Datasets Guide should contain a section about 
> partitioned, sorted and bucketed writes.
> - Bucketing should be removed from Unsupported Hive Functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide

2017-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004523#comment-16004523
 ] 

Apache Spark commented on SPARK-20694:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17938

> Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
> --
>
> Key: SPARK-20694
> URL: https://issues.apache.org/jira/browse/SPARK-20694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> - Spark SQL, DataFrames and Datasets Guide should contain a section about 
> partitioned, sorted and bucketed writes.
> - Bucketing should be removed from Unsupported Hive Functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide

2017-05-10 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-20694:
--

 Summary: Document DataFrameWriter partitionBy, bucketBy and sortBy 
in SQL guide
 Key: SPARK-20694
 URL: https://issues.apache.org/jira/browse/SPARK-20694
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Examples, SQL
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz


- Spark SQL, DataFrames and Datasets Guide should contain a section about 
partitioned, sorted and bucketed writes.
- Bucketing should be removed from Unsupported Hive Functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20688) correctly check analysis for scalar sub-queries

2017-05-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20688.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 17930
[https://github.com/apache/spark/pull/17930]

> correctly check analysis for scalar sub-queries
> ---
>
> Key: SPARK-20688
> URL: https://issues.apache.org/jira/browse/SPARK-20688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20693) Kafka+SSL: path for security related files needs to be different for driver and executors

2017-05-10 Thread JIRA
Daniel Lanza García created SPARK-20693:
---

 Summary: Kafka+SSL: path for security related files needs to be 
different for driver and executors 
 Key: SPARK-20693
 URL: https://issues.apache.org/jira/browse/SPARK-20693
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.1
Reporter: Daniel Lanza García
Priority: Critical


When consuming/producing from Kafka with security enable (SSL), you need to 
refer to security related files (keystore and truststore) in the configuration 
of the KafkaDirectStream.

If the scenario is YARN-client mode, you would need to distribute these files, 
it can be achieved with --files argument. Now, what is the path to these files? 
taking into account that driver and executors interact with Kafka.

When these files are accessed from the driver, you need to provide the local 
path to them. When they are accessed from the executors, you need to provide 
the name of the file that has been distributed with --files. 

The problem is that you can only configure one value for the path to these 
files.

Proposed configurations here: 
http://www.opencore.com/blog/2017/1/spark-2-0-streaming-from-ssl-kafka-with-hdp-2-4/
 
works because both paths are the same (./truststore.jks). But if different, I 
do not think there is a way to configure Kafka+SSL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20692) unknowing delay in event timeline

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20692.
---
Resolution: Invalid

This isn't appropriate for a JIRA. Questions should go to 
u...@spark.apache.org. There are lots of reasons for delays between jobs; many 
things could be busy on the driver. It's not actionable.

> unknowing delay in event timeline
> -
>
> Key: SPARK-20692
> URL: https://issues.apache.org/jira/browse/SPARK-20692
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: Spark 1.6.1 + kafka 0.8.2
>Reporter: Zhiwen Sun
> Attachments: screenshot-1.png
>
>
> Spark streaming job with 1s interval.
> Process time of micro batch suddenly became to 4s while is is usually 0.4s .
> When we check where the time spent, we find a unknown delay in job. 
> There is no executor computing or shuffle reading. It is about 4s blank in 
> event timeline, 
> event timeline snapshot is in attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20393.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17686
[https://github.com/apache/spark/pull/17686]

> Strengthen Spark to prevent XSS vulnerabilities
> ---
>
> Key: SPARK-20393
> URL: https://issues.apache.org/jira/browse/SPARK-20393
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 2.0.2, 2.1.0
>Reporter: Nicholas Marion
>Assignee: Nicholas Marion
>Priority: Minor
>  Labels: security
> Fix For: 2.3.0
>
>
> Using IBM Security AppScan Standard, we discovered several easy to recreate 
> MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI 
> application and these vulnerabilities were found to exist in Spark version 
> 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting 
> attack is not really an attack on the Spark server as much as an attack on 
> the end user, taking advantage of their trust in the Spark server to get them 
> to click on a URL like the ones in the examples below.  So whether the user 
> could or could not change lots of stuff on the Spark server is not the key 
> point.  It is an attack on the user themselves.  If they click the link the 
> script could run in their browser and comprise their device.  Once the 
> browser is compromised it could submit Spark requests but it also might not.
> https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/
> {quote}
> Request: GET 
> /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
> _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> HTTP/1.1
> Excerpt from response: No running application with ID 
> Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> 
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET 
> /history/app-20161012202114-0038/stages/stage?id=1=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> k.pageSize=100 HTTP/1.1
> Excerpt from response: Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET /log?appId=app-20170113131903-=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> eLength=0 HTTP/1.1
> Excerpt from response:  Bytes 0-0 of 0 of 
> /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content-
> Type: multipart/related; boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> {quote}
> security@apache was notified and recommended a PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20393:
-

Assignee: Nicholas Marion
  Labels: security  (was: newbie security)
Priority: Minor  (was: Major)

> Strengthen Spark to prevent XSS vulnerabilities
> ---
>
> Key: SPARK-20393
> URL: https://issues.apache.org/jira/browse/SPARK-20393
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 2.0.2, 2.1.0
>Reporter: Nicholas Marion
>Assignee: Nicholas Marion
>Priority: Minor
>  Labels: security
> Fix For: 2.3.0
>
>
> Using IBM Security AppScan Standard, we discovered several easy to recreate 
> MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI 
> application and these vulnerabilities were found to exist in Spark version 
> 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting 
> attack is not really an attack on the Spark server as much as an attack on 
> the end user, taking advantage of their trust in the Spark server to get them 
> to click on a URL like the ones in the examples below.  So whether the user 
> could or could not change lots of stuff on the Spark server is not the key 
> point.  It is an attack on the user themselves.  If they click the link the 
> script could run in their browser and comprise their device.  Once the 
> browser is compromised it could submit Spark requests but it also might not.
> https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/
> {quote}
> Request: GET 
> /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
> _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> HTTP/1.1
> Excerpt from response: No running application with ID 
> Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> 
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET 
> /history/app-20161012202114-0038/stages/stage?id=1=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> k.pageSize=100 HTTP/1.1
> Excerpt from response: Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET /log?appId=app-20170113131903-=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> eLength=0 HTTP/1.1
> Excerpt from response:  Bytes 0-0 of 0 of 
> /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content-
> Type: multipart/related; boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> {quote}
> security@apache was notified and recommended a PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20692) unknowing delay in event timeline

2017-05-10 Thread Zhiwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiwen Sun updated SPARK-20692:
---
Description: 
Spark streaming job with 1s interval.

Process time of micro batch suddenly became to 4s while is is usually 0.4s .

When we check where the time spent, we find a unknown delay in job. 

There is no executor computing or shuffle reading. It is about 4s blank in 
event timeline, 

event timeline snapshot is in attachment.


  was:
Spark streaming job with 1s interval.

Process time of micro batch suddenly became to 4s while is is usually 0.4s .

when we check where time spent, we find a unknown delay in job. there is no 
executor computing or shuffle reading. About 4s blank in event timeline, 

event timeline snapshot is in attachment.



> unknowing delay in event timeline
> -
>
> Key: SPARK-20692
> URL: https://issues.apache.org/jira/browse/SPARK-20692
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: Spark 1.6.1 + kafka 0.8.2
>Reporter: Zhiwen Sun
> Attachments: screenshot-1.png
>
>
> Spark streaming job with 1s interval.
> Process time of micro batch suddenly became to 4s while is is usually 0.4s .
> When we check where the time spent, we find a unknown delay in job. 
> There is no executor computing or shuffle reading. It is about 4s blank in 
> event timeline, 
> event timeline snapshot is in attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20692) unknowing delay in event timeline

2017-05-10 Thread Zhiwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiwen Sun updated SPARK-20692:
---
Description: 
Spark streaming job with 1s interval.

Process time of micro batch suddenly became to 4s while is is usually 0.4s .

when we check where time spent, we find a unknown delay in job. there is no 
executor computing or shuffle reading. About 4s blank in event timeline, 

event timeline snapshot is in attachment.


  was:
Spark streaming job with 1s interval.

Process time of micro batch suddenly became to 4s while is is usually 0.4s .

when we check where time spent, we find a unknown delay in job. there is no 
executor computing or shuffle reading. About 4s blank in event timeline, 



> unknowing delay in event timeline
> -
>
> Key: SPARK-20692
> URL: https://issues.apache.org/jira/browse/SPARK-20692
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: Spark 1.6.1 + kafka 0.8.2
>Reporter: Zhiwen Sun
> Attachments: screenshot-1.png
>
>
> Spark streaming job with 1s interval.
> Process time of micro batch suddenly became to 4s while is is usually 0.4s .
> when we check where time spent, we find a unknown delay in job. there is no 
> executor computing or shuffle reading. About 4s blank in event timeline, 
> event timeline snapshot is in attachment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20692) unknowing delay in event timeline

2017-05-10 Thread Zhiwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiwen Sun updated SPARK-20692:
---
Attachment: screenshot-1.png

> unknowing delay in event timeline
> -
>
> Key: SPARK-20692
> URL: https://issues.apache.org/jira/browse/SPARK-20692
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: Spark 1.6.1 + kafka 0.8.2
>Reporter: Zhiwen Sun
> Attachments: screenshot-1.png
>
>
> Spark streaming job with 1s interval.
> Process time of micro batch suddenly became to 4s while is is usually 0.4s .
> when we check where time spent, we find a unknown delay in job. there is no 
> executor computing or shuffle reading. About 4s blank in event timeline, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20692) unknowing delay in event timeline

2017-05-10 Thread Zhiwen Sun (JIRA)
Zhiwen Sun created SPARK-20692:
--

 Summary: unknowing delay in event timeline
 Key: SPARK-20692
 URL: https://issues.apache.org/jira/browse/SPARK-20692
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.6.2
 Environment: Spark 1.6.1 + kafka 0.8.2
Reporter: Zhiwen Sun


Spark streaming job with 1s interval.

Process time of micro batch suddenly became to 4s while is is usually 0.4s .

when we check where time spent, we find a unknown delay in job. there is no 
executor computing or shuffle reading. About 4s blank in event timeline, 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-05-10 Thread David Hodeffi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004417#comment-16004417
 ] 

David Hodeffi commented on SPARK-20433:
---

Did you upgrade json4s? since 3.2.1 is not compatible with 3.5 

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20691) Difference between Storage Memory as seen internally and in web UI

2017-05-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004413#comment-16004413
 ] 

Sean Owen commented on SPARK-20691:
---

I think that we have, unfortunately, not consistently differentiated between 
megabytes (MB, 10^6 = 100 bytes) and mebibytes (MiB, 2^20 = 1048576 bytes). 
The javascript function is actually correctly computing MB, and the rest of 
Spark is not in {{Utils.bytesToString}}. 

Where the user supplies a value like "700m" it's interpreted as mebibytes. 
That's fine and at least unambiguous and not-wrong.

I don't think we want to change behavior, but we can make display and log 
output more consistent and correct.

The inconsistency is bad of course. The simple change is to change the 
javascript, and at least update its strings to say "MiB" etc correctly. I think 
{{Utils.bytesToString}} should be changed too. If you want to be a hero, you 
might look for anywhere "MB" or "KB" occurs in the code and see if the value is 
being computed correctly.

> Difference between Storage Memory as seen internally and in web UI
> --
>
> Key: SPARK-20691
> URL: https://issues.apache.org/jira/browse/SPARK-20691
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>
> I set Major priority as it's visible to a user.
> There's a difference in what the size of Storage Memory is managed internally 
> and displayed to a user in web UI.
> I found it while answering [How does web UI calculate Storage Memory (in 
> Executors tab)?|http://stackoverflow.com/q/43801062/1305344] on StackOverflow.
> In short (quoting the main parts), when you start a Spark app (say 
> spark-shell) you see 912.3 MB RAM for Storage Memory:
> {code}
> $ ./bin/spark-shell --conf spark.driver.memory=2g
> ...
> 17/05/07 15:20:50 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.1.8:57177 with 912.3 MB RAM, BlockManagerId(driver, 192.168.1.8, 
> 57177, None)
> {code}
> but in the web UI you'll see 956.6 MB due to the way the custom JavaScript 
> function {{formatBytes}} in 
> [utils.js|https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/utils.js#L40-L48]
>  calculates the value. That translates to the following Scala code:
> {code}
> def formatBytes(bytes: Double) = {
>   val k = 1000
>   val i = math.floor(math.log(bytes) / math.log(k))
>   val maxMemoryWebUI = bytes / math.pow(k, i)
>   f"$maxMemoryWebUI%1.1f"
> }
> scala> println(formatBytes(maxMemory))
> 956.6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20433) Security issue with jackson-databind

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20433.
---
Resolution: Not A Problem

OK, I don't see evidence that this isn't just an instance of the problem that 
was already patched. Updating past Jackson 2.6 is a separate and more complex 
issue, otherwise we'd just update to be safe.

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20687:
--
Priority: Minor  (was: Critical)

> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004372#comment-16004372
 ] 

Sean Owen commented on SPARK-20687:
---

This doesn't say what the problem is. What goes wrong?


> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Critical
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20637) MappedRDD, FilteredRDD, etc. are still referenced in code comments

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20637:
-

Assignee: Michael Mior

> MappedRDD, FilteredRDD, etc. are still referenced in code comments
> --
>
> Key: SPARK-20637
> URL: https://issues.apache.org/jira/browse/SPARK-20637
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
> Fix For: 2.2.1
>
>
> There are only a couple instances of this, but it would be helpful to have 
> things updated to current references.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20637) MappedRDD, FilteredRDD, etc. are still referenced in code comments

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20637.
---
   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 17900
[https://github.com/apache/spark/pull/17900]

> MappedRDD, FilteredRDD, etc. are still referenced in code comments
> --
>
> Key: SPARK-20637
> URL: https://issues.apache.org/jira/browse/SPARK-20637
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Mior
>Priority: Trivial
> Fix For: 2.2.1
>
>
> There are only a couple instances of this, but it would be helpful to have 
> things updated to current references.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20630.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17904
[https://github.com/apache/spark/pull/17904]

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20630:
-

 Assignee: Alex Bozarth
Affects Version/s: (was: 2.3.0)
   2.2.0
Fix Version/s: (was: 2.2.0)
   2.2.1

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
>Priority: Minor
> Fix For: 2.2.1
>
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-20631:
---

Assignee: Maciej Szymkiewicz

> LogisticRegression._checkThresholdConsistency should use values not Params
> --
>
> Key: SPARK-20631
> URL: https://issues.apache.org/jira/browse/SPARK-20631
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
> access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
> {{Param}} instead of {{str}}:
> {code}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
> Traceback (most recent call last):
> ...
> TypeError: getattr(): attribute name must be string
> {code}
> Finally exception message uses {{join}} without converting values to {{str}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-20631.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2
   2.0.3

> LogisticRegression._checkThresholdConsistency should use values not Params
> --
>
> Key: SPARK-20631
> URL: https://issues.apache.org/jira/browse/SPARK-20631
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
> access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
> {{Param}} instead of {{str}}:
> {code}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
> Traceback (most recent call last):
> ...
> TypeError: getattr(): attribute name must be string
> {code}
> Finally exception message uses {{join}} without converting values to {{str}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching

2017-05-10 Thread Teng Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004342#comment-16004342
 ] 

Teng Jiang commented on SPARK-20638:


A further 88x improvement is show in my PR comment.
https://github.com/apache/spark/pull/17898#issuecomment-299818394

> Optimize the CartesianRDD to reduce repeatedly data fetching
> 
>
> Key: SPARK-20638
> URL: https://issues.apache.org/jira/browse/SPARK-20638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Teng Jiang
>
> In CartesianRDD, group each iterator to multiple groups. Thus in the second 
> iteration, the data with be fetched (num of data)/groupSize times, rather 
> than (num of data) times.
> The test results are:
> Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
> Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each 
> is a 10-dim vector.
> With default CartesianRDD, cartesian time is 2420.7s.
> With this proposal, cartesian time is 45.3s
> 50x faster than the original method.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >