[jira] [Commented] (SPARK-10590) Spark with YARN build is broken

2016-10-12 Thread Nirman Narang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570963#comment-15570963
 ] 

Nirman Narang commented on SPARK-10590:
---

[~rxin], Should this ticket be reopened?

> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2016-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17867:

Assignee: Liang-Chi Hsieh

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2016-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17867.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15427
[https://github.com/apache/spark/pull/15427]

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17866) Dataset.dropDuplicates (i.e., distinct) should not change the output of child plan

2016-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17866:

Assignee: Liang-Chi Hsieh

> Dataset.dropDuplicates (i.e., distinct) should not change the output of child 
> plan
> --
>
> Key: SPARK-17866
> URL: https://issues.apache.org/jira/browse/SPARK-17866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We create new Alias with new exprId in Dataset.dropDuplicates now. However it 
> causes problem when we want to select the columns as follows:
> {code}
> val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
> // ds("_2") will cause analysis exception
> ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17866) Dataset.dropDuplicates (i.e., distinct) should not change the output of child plan

2016-10-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17866.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15427
[https://github.com/apache/spark/pull/15427]

> Dataset.dropDuplicates (i.e., distinct) should not change the output of child 
> plan
> --
>
> Key: SPARK-17866
> URL: https://issues.apache.org/jira/browse/SPARK-17866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We create new Alias with new exprId in Dataset.dropDuplicates now. However it 
> causes problem when we want to select the columns as follows:
> {code}
> val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
> // ds("_2") will cause analysis exception
> ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17902) collect() ignores stringsAsFactors

2016-10-12 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17902:
--

 Summary: collect() ignores stringsAsFactors
 Key: SPARK-17902
 URL: https://issues.apache.org/jira/browse/SPARK-17902
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki


`collect()` function signature includes an optional flag named 
`stringsAsFactors`. It seems it is completely ignored.

{code}
str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException

2016-10-12 Thread Harish (JIRA)
Harish created SPARK-17901:
--

 Summary: NettyRpcEndpointRef: Error sending message and Caused by: 
java.util.ConcurrentModificationException
 Key: SPARK-17901
 URL: https://issues.apache.org/jira/browse/SPARK-17901
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
Reporter: Harish


I have 2 data frames one with 10K rows and 10,000 columns and another with 4M 
rows with 50 columns. I joined this and trying to find mean of merged data set,

i calculated the mean using lamda using python mean() function. I cant write in 
pyspark due to 64KB code limit issue.

After calculating the mean i did rdd.take(2). it works.But creating the DF from 
RDD and DF.show is progress for more than 2 hours (I stopped the process) with 
below message  (102 GB , 6 cores per node -- total 10 nodes+ 1master)



16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, 35729))] 
in 1 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject(ArrayList.java:766)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at 
java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 

[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2016-10-12 Thread Shivansh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570826#comment-15570826
 ] 

Shivansh edited comment on SPARK-16599 at 10/13/16 4:53 AM:


[~srowen], [~joshrosen]: Any updates on this issue ? We are also facing the 
same issue here. Can you please let us know what is the exact problem ?? We are 
using Cassandra as a store .


was (Author: shiv4nsh):
[~srowen], [~joshrosen]: Any updates on this issue ? We are also facing the 
same issue here. Can you please ket us know what is the exact problem ??

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17899) add a debug mode to keep raw table properties in HiveExternalCatalog

2016-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17899:


Assignee: Wenchen Fan  (was: Apache Spark)

> add a debug mode to keep raw table properties in HiveExternalCatalog
> 
>
> Key: SPARK-17899
> URL: https://issues.apache.org/jira/browse/SPARK-17899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17899) add a debug mode to keep raw table properties in HiveExternalCatalog

2016-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17899:


Assignee: Apache Spark  (was: Wenchen Fan)

> add a debug mode to keep raw table properties in HiveExternalCatalog
> 
>
> Key: SPARK-17899
> URL: https://issues.apache.org/jira/browse/SPARK-17899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17899) add a debug mode to keep raw table properties in HiveExternalCatalog

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570837#comment-15570837
 ] 

Apache Spark commented on SPARK-17899:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15458

> add a debug mode to keep raw table properties in HiveExternalCatalog
> 
>
> Key: SPARK-17899
> URL: https://issues.apache.org/jira/browse/SPARK-17899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2016-10-12 Thread Shivansh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570826#comment-15570826
 ] 

Shivansh commented on SPARK-16599:
--

[~srowen], [~joshrosen]: Any updates on this issue ? We are also facing the 
same issue here. Can you please ket us know what is the exact problem ??

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17876) Write StructuredStreaming WAL to a stream instead of materializing all at once

2016-10-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17876.
--
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.1.0
   2.0.2

> Write StructuredStreaming WAL to a stream instead of materializing all at once
> --
>
> Key: SPARK-17876
> URL: https://issues.apache.org/jira/browse/SPARK-17876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.2, 2.1.0
>
>
> The CompactibleFileStreamLog materializes the whole metadata log in memory as 
> a String. This can cause issues when there are lots of files that are being 
> committed, especially during a compaction batch. 
> You may come across stacktraces that look like:
> {code}
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.lang.StringCoding.encode(StringCoding.java:350)
> at java.lang.String.getBytes(String.java:941)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
> at 
> {code}
> The safer way is to write to an output stream so that we don't have to 
> materialize a huge string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17900:

Description: 
Mark the following stable:

Dataset/DataFrame
- functions, since 1.3
- ColumnName, since 1.3
- DataFrameNaFunctions, since 1.3.1
- DataFrameStatFunctions, since 1.4
- UserDefinedFunction, since 1.3
- UserDefinedAggregateFunction, since 1.5
- Window and WindowSpec, since 1.4

Data sources:
- DataSourceRegister, since 1.5
- RelationProvider, since 1.3
- SchemaRelationProvider, since 1.3
- CreatableRelationProvider, since 1.3
- BaseRelation, since 1.3
- TableScan, since 1.3
- PrunedScan, since 1.3
- PrunedFilteredScan, since 1.3
- InsertableRelation, since 1.3


Keep the following experimental / evolving:

Data sources:
- CatalystScan (tied to internal logical plans so it is not stable by 
definition)

Structured streaming:
- all classes (introduced new in 2.0 and will likely change)

Dataset typed operations (introduced in 1.6 and 2.0 and might change, although 
probability is low)
- all typed methods on Dataset
- KeyValueGroupedDataset
- o.a.s.sql.expressions.javalang.typed
- o.a.s.sql.expressions.scalalang.typed
- methods that return typed Dataset in SparkSession


  was:

Mark the following stable:

Dataset/DataFrame
- functions, since 1.3
- ColumnName, since 1.3
- DataFrameNaFunctions, since 1.3.1
- DataFrameStatFunctions, since 1.4
- UserDefinedFunction, since 1.3
- UserDefinedAggregateFunction, since 1.5
- Window and WindowSpec, since 1.4

Data sources:
- DataSourceRegister, since 1.5
- RelationProvider, since 1.3
- SchemaRelationProvider, since 1.3
- CreatableRelationProvider, since 1.3
- BaseRelation, since 1.3
- TableScan, since 1.3
- PrunedScan, since 1.3
- PrunedFilteredScan, since 1.3
- InsertableRelation, since 1.3


Keep the following experimental / evolving:

Data sources:
- CatalystScan

Structured streaming:
- all classes remain Experimental / Evolving

Dataset typed operations:
- all typed methods on Dataset
- KeyValueGroupedDataset
- o.a.s.sql.expressions.javalang.typed
- o.a.s.sql.expressions.scalalang.typed
- methods that return typed Dataset in SparkSession



> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17899) add a debug mode to keep raw table properties in HiveExternalCatalog

2016-10-12 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-17899:
---

 Summary: add a debug mode to keep raw table properties in 
HiveExternalCatalog
 Key: SPARK-17899
 URL: https://issues.apache.org/jira/browse/SPARK-17899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-12 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17900:
---

 Summary: Mark the following Spark SQL APIs as stable
 Key: SPARK-17900
 URL: https://issues.apache.org/jira/browse/SPARK-17900
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin



Mark the following stable:

Dataset/DataFrame
- functions, since 1.3
- ColumnName, since 1.3
- DataFrameNaFunctions, since 1.3.1
- DataFrameStatFunctions, since 1.4
- UserDefinedFunction, since 1.3
- UserDefinedAggregateFunction, since 1.5
- Window and WindowSpec, since 1.4

Data sources:
- DataSourceRegister, since 1.5
- RelationProvider, since 1.3
- SchemaRelationProvider, since 1.3
- CreatableRelationProvider, since 1.3
- BaseRelation, since 1.3
- TableScan, since 1.3
- PrunedScan, since 1.3
- PrunedFilteredScan, since 1.3
- InsertableRelation, since 1.3


Keep the following experimental / evolving:

Data sources:
- CatalystScan

Structured streaming:
- all classes remain Experimental / Evolving

Dataset typed operations:
- all typed methods on Dataset
- KeyValueGroupedDataset
- o.a.s.sql.expressions.javalang.typed
- o.a.s.sql.expressions.scalalang.typed
- methods that return typed Dataset in SparkSession




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17830) Annotate Spark SQL public APIs with InterfaceStability

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570726#comment-15570726
 ] 

Apache Spark commented on SPARK-17830:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15457

> Annotate Spark SQL public APIs with InterfaceStability
> --
>
> Key: SPARK-17830
> URL: https://issues.apache.org/jira/browse/SPARK-17830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> I'm going to use this as an exercise to see how well the interface stability 
> annotation works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16827:

Fix Version/s: 2.0.2

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.0.2, 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570716#comment-15570716
 ] 

Reynold Xin commented on SPARK-16827:
-

See https://github.com/apache/spark/pull/15455/files 

The spill there is not tracked.

Also I don't think it is that useful to track spill time, because you can't 
actually measure that accurately due to pipelining.


> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.0.2, 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17898) --repositories needs username and password

2016-10-12 Thread lichenglin (JIRA)
lichenglin created SPARK-17898:
--

 Summary: --repositories  needs username and password
 Key: SPARK-17898
 URL: https://issues.apache.org/jira/browse/SPARK-17898
 Project: Spark
  Issue Type: Wish
Affects Versions: 2.0.1
Reporter: lichenglin


My private repositories need username and password to visit.

I can't find a way to declaration  the username and password when submit spark 
application
{code}
bin/spark-submit --repositories   
http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
com.databricks:spark-csv_2.10:1.2.0   --class org.apache.spark.examples.SparkPi 
  --master local[8]   examples/jars/spark-examples_2.11-2.0.1.jar   100
{code}

The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username and 
password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17835) Optimize NaiveBayes mllib wrapper to eliminate extra pass on data

2016-10-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17835.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
> -
>
> Key: SPARK-17835
> URL: https://issues.apache.org/jira/browse/SPARK-17835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and 
> left mllib as a wrapper. However, there are some difference between mllib and 
> ml to handle {{labels}}:
> * mllib allow input labels as {-1, +1}, however, ml assumes the input labels 
> in range [0, numClasses).
> * mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
> assumption mention above.
> During the copy in SPARK-14077, we use {{val labels = 
> data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
> firstly, and then encode the labels for training. It involves extra Spark job 
> compared with the original implementation. Since {{NaiveBayes}} only do one 
> pass aggregation during training, add another one seems less efficient. We 
> can get the labels in a single pass along with {{NaiveBayes}} training and 
> send them to MLlib side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17745) Update Python API for NB to support weighted instances

2016-10-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17745.
-
   Resolution: Fixed
 Assignee: Weichen Xu
Fix Version/s: 2.1.0

> Update Python API for NB to support weighted instances
> --
>
> Key: SPARK-17745
> URL: https://issues.apache.org/jira/browse/SPARK-17745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.1.0
>
>
> Update python wrapper of NB to support weighted instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17888) Memory leak in streaming driver when use SparkSQL in Streaming

2016-10-12 Thread weilin.chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weilin.chen updated SPARK-17888:

Summary: Memory leak in streaming driver when use SparkSQL in Streaming  
(was: Mseory leak in streaming driver when use SparkSQL in Streaming)

> Memory leak in streaming driver when use SparkSQL in Streaming
> --
>
> Key: SPARK-17888
> URL: https://issues.apache.org/jira/browse/SPARK-17888
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: scala 2.10.4
> java 1.7.0_71
>Reporter: weilin.chen
>  Labels: leak, memory
>
> Hi
>   I have a little program of spark 1.5, it receive data from a publisher in 
> spark streaming. It will process these received data with spark sql. But when 
> the time goes by I found the memory leak in driver, so i update to spark 
> 1.6.2. But, there is no change in the situation.
> here is the code:
> {quote}
>  val lines = ssc.receiverStream(new RReceiver("10.0.200.15", 6380, 
> "subresult"))
> val jsonf = 
> lines.map(JSON.parseFull(_)).map(_.get.asInstanceOf[scala.collection.immutable.Map[String,
>  Any]])
> val logs = jsonf.map(data => LogStashV1(data("message").toString, 
> data("path").toString, data("host").toString, 
> data("lineno").toString.toDouble, data("timestamp").toString))
> logs.foreachRDD( rdd => { 
>  import sqc.implicits._
>  rdd.toDF.registerTempTable("logstash")
>  val sqlreport0 = sqc.sql("SELECT message, COUNT(message) AS host_c, 
> SUM(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND 
> lineno > 70 GROUP BY message ORDER BY host_c DESC LIMIT 100")
>  sqlreport0.map(t => AlertMsg(t(0).toString, t(1).toString.toInt, 
> t(2).toString.toDouble)).collect().foreach(println)
> sqlreport0.map(t => AlertMsg(t(0).toString, t(1).toString.toInt, 
> t(2).toString.toDouble)).collect().foreach(println)
>  {quote}
> jmap information:
>  {quote}
>  num #instances #bytes  class name
> --
>1: 34819   72711952  [B
>2:   2297557   66010656  [C
>3:   2296294   55111056  java.lang.String
>4:   1063491   42539640  org.apache.spark.scheduler.AccumulableInfo
>5:   1251001   40032032  
> scala.collection.immutable.HashMap$HashMap1
>6:   1394364   33464736  java.lang.Long
>7:   1102516   26460384  scala.collection.immutable.$colon$colon
>8:   1058202   25396848  
> org.apache.spark.sql.execution.metric.LongSQLMetricValue
>9:   1266499   20263984  scala.Some
>   10:124052   15889104  
>   11:124052   15269568  
>   12: 11350   12082432  
>   13: 11350   11692880  
>   14: 96682   10828384  org.apache.spark.executor.TaskMetrics
>   15:2334819505896  [Lscala.collection.immutable.HashMap;
>   16: 966826961104  org.apache.spark.scheduler.TaskInfo
>   17:  95896433312  
>   18:2330005592000  
> scala.collection.immutable.HashMap$HashTrieMap
>   19: 962005387200  
> org.apache.spark.executor.ShuffleReadMetrics
>   20:1133813628192  scala.collection.mutable.ListBuffer
>   21:  72522891792  
>   22:1170732809752  scala.collection.mutable.DefaultEntry
>  {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull

2016-10-12 Thread Jordan Halterman (JIRA)
Jordan Halterman created SPARK-17897:


 Summary: not isnotnull is converted to the always false condition 
isnotnull && not isnotnull
 Key: SPARK-17897
 URL: https://issues.apache.org/jira/browse/SPARK-17897
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.0.1, 2.0.0
Reporter: Jordan Halterman


When a logical plan is built containing the following somewhat nonsensical 
filter:
{{Filter (NOT isnotnull($f0#212))}}

During optimization the filter is converted into a condition that will always 
fail:
{{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}}

This appears to be caused by the following check for {{NullIntolerant}}:

https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63

Which recurses through the expression and extracts nested {{IsNotNull}} calls, 
converting them to {{IsNotNull}} calls on the attribute at the root level:

https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49

This results in the nonsensical condition above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17686) Propose to print Scala version in "spark-submit --version" command

2016-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17686:


Assignee: Apache Spark

> Propose to print Scala version in "spark-submit --version" command
> --
>
> Key: SPARK-17686
> URL: https://issues.apache.org/jira/browse/SPARK-17686
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently we have a use case that needs to upload different jars to Spark 
> according to scala version. For now only after launching Spark application 
> can we know which version of Scala it depends on. It makes hard for some 
> services which needs to support different Scala + Spark versions to pick the 
> right jars. 
> So here propose to print out Scala version according to Spark version in 
> "spark-submit --version", so that user could leverage this output to make the 
> choice without needing to launching application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17686) Propose to print Scala version in "spark-submit --version" command

2016-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17686:


Assignee: (was: Apache Spark)

> Propose to print Scala version in "spark-submit --version" command
> --
>
> Key: SPARK-17686
> URL: https://issues.apache.org/jira/browse/SPARK-17686
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently we have a use case that needs to upload different jars to Spark 
> according to scala version. For now only after launching Spark application 
> can we know which version of Scala it depends on. It makes hard for some 
> services which needs to support different Scala + Spark versions to pick the 
> right jars. 
> So here propose to print out Scala version according to Spark version in 
> "spark-submit --version", so that user could leverage this output to make the 
> choice without needing to launching application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17686) Propose to print Scala version in "spark-submit --version" command

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570572#comment-15570572
 ] 

Apache Spark commented on SPARK-17686:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/15456

> Propose to print Scala version in "spark-submit --version" command
> --
>
> Key: SPARK-17686
> URL: https://issues.apache.org/jira/browse/SPARK-17686
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently we have a use case that needs to upload different jars to Spark 
> according to scala version. For now only after launching Spark application 
> can we know which version of Scala it depends on. It makes hard for some 
> services which needs to support different Scala + Spark versions to pick the 
> right jars. 
> So here propose to print out Scala version according to Spark version in 
> "spark-submit --version", so that user could leverage this output to make the 
> choice without needing to launching application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Gaoxiang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570446#comment-15570446
 ] 

Gaoxiang Liu edited comment on SPARK-16827 at 10/13/16 1:16 AM:


[~rxin], for this one, I think spill byte (both memory and disk), and shuffle 
bytes are already logged and reported, right ?
Also, if I want to add spill time metrics, do you suggest I create a parent 
class DiskWriteMetrics, and ShuffleWriteMetrics and my new class (eg 
SpillWriteMetrics) inherit from it, and then pass parent 
class(DiskWriteMetrics) to UnsafeSorterSpillWriter 
https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209
 ?  

Or do you suggest rename the ShuffleWriteMetrics class to something like 
WriteMetrics ?


was (Author: dreamworks007):
[~rxin], for this one, if I want to add spill time metrics, do you suggest I 
create a parent class DiskWriteMetrics, and ShuffleWriteMetrics and my new 
class (eg SpillWriteMetrics) inherit from it, and then pass parent 
class(DiskWriteMetrics) to UnsafeSorterSpillWriter 
https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209
 ?  

Or do you suggest rename the ShuffleWriteMetrics class to something like 
WriteMetrics ?

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Gaoxiang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570446#comment-15570446
 ] 

Gaoxiang Liu edited comment on SPARK-16827 at 10/13/16 1:14 AM:


[~rxin], for this one, if I want to add spill time metrics, do you suggest I 
create a parent class DiskWriteMetrics, and ShuffleWriteMetrics and my new 
class (eg SpillWriteMetrics) inherit from it, and then pass parent 
class(DiskWriteMetrics) to UnsafeSorterSpillWriter 
https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209
 ?  

Or do you suggest rename the ShuffleWriteMetrics class to something like 
WriteMetrics ?


was (Author: dreamworks007):
[~rxin], for this one, if I want to add spill metrics, do you suggest I create 
a parent class DiskWriteMetrics, and ShuffleWriteMetrics and my new class (eg 
SpillWriteMetrics) inherit from it, and then pass parent 
class(DiskWriteMetrics) to UnsafeSorterSpillWriter 
https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209
 ?  

Or do you suggest rename the ShuffleWriteMetrics class to something like 
WriteMetrics ?

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Gaoxiang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570446#comment-15570446
 ] 

Gaoxiang Liu commented on SPARK-16827:
--

[~rxin], for this one, if I want to add spill metrics, do you suggest I create 
a parent class DiskWriteMetrics, and ShuffleWriteMetrics and my new class (eg 
SpillWriteMetrics) inherit from it, and then pass parent 
class(DiskWriteMetrics) to UnsafeSorterSpillWriter 
https://github.com/facebook/FB-Spark/blob/fb-2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L209
 ?  

Or do you suggest rename the ShuffleWriteMetrics class to something like 
WriteMetrics ?

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17074) generate histogram information for column

2016-10-12 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570366#comment-15570366
 ] 

Zhenhua Wang edited comment on SPARK-17074 at 10/13/16 12:55 AM:
-

Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and our code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for string histogram (equi-width) is already sent. 
I'll start to work on this one today and send a pr in the following days. 
Thanks!


was (Author: zenwzh):
Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and our code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for string histogram is already sent. I'll start to 
work on this one today and send a pr in the following days. Thanks!

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17074) generate histogram information for column

2016-10-12 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570366#comment-15570366
 ] 

Zhenhua Wang edited comment on SPARK-17074 at 10/13/16 12:29 AM:
-

Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and our code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for string histogram is already sent. I'll start to 
work on this one today and send a pr in the following days. Thanks!


was (Author: zenwzh):
Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and its code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for string histogram is already sent. I'll start to 
work on this one today and send a pr in the following days. Thanks!

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-10-12 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570366#comment-15570366
 ] 

Zhenhua Wang commented on SPARK-17074:
--

Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and its code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for string histogram is already sent. I'll start to 
work on this one today and send a pr in the following days. Thanks!

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17845:
---
Fix Version/s: 2.1.0

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17845.

Resolution: Fixed

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15408) Spark streaming app crashes with NotLeaderForPartitionException

2016-10-12 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger closed SPARK-15408.
--
Resolution: Cannot Reproduce

> Spark streaming app crashes with NotLeaderForPartitionException 
> 
>
> Key: SPARK-15408
> URL: https://issues.apache.org/jira/browse/SPARK-15408
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubuntu 64 bit
>Reporter: Johny Mathew
>Priority: Critical
>
> We have a spark streaming application reading from kafka (with Kafka Direct 
> API) and it crashed with the exception shown in the next paragraph. We have a 
> 5 node kafka cluster with 19 partitions  (replication factor 3). Even though 
> the the spark application crashed the other kafka consumer apps were running 
> fine. Only one of the 5 kafka node was not working correctly (it did not go 
> down)
> /opt/hadoop/bin/yarn application -status application_1463151451543_0007
> 16/05/13 20:09:56 INFO client.RMProxy: Connecting to ResourceManager at 
> /172.16.130.189:8050
> Application Report :
>   Application-Id : application_1463151451543_0007
>   Application-Name : com.ibm.alchemy.eventgen.EventGenMetrics
>   Application-Type : SPARK
>   User : stack
>   Queue : default
>   Start-Time : 1463155034571
>   Finish-Time : 1463155310520
>   Progress : 100%
>   State : FINISHED
>   Final-State : FAILED
>   Tracking-URL : N/A
>   RPC Port : 0
>   AM Host : 172.16.130.188
>   Aggregate Resource Allocation : 9562329 MB-seconds, 2393 vcore-seconds
>   Diagnostics : User class threw exception: 
> org.apache.spark.SparkException: 
> ArrayBuffer(kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, 
> kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: 
> Couldn't find leader offsets for Set([alchemy-metrics,17], 
> [alchemy-metrics,10], [alchemy-metrics,3], [alchemy-metrics,4], 
> [alchemy-metrics,9], [alchemy-metrics,15], [alchemy-metrics,18], 
> [alchemy-metrics,5]))
> We cleared checkpoint and started the application but it crashed again. Then 
> at the end we found out the misbehaving kafka node and restarted it which 
> fixed the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15272) DirectKafkaInputDStream doesn't work with window operation

2016-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570221#comment-15570221
 ] 

Cody Koeninger commented on SPARK-15272:


Checking to see if the 0.10 consumer's handling of preferred locations 
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#locationstrategies
 addresses this.

> DirectKafkaInputDStream doesn't work with window operation
> --
>
> Key: SPARK-15272
> URL: https://issues.apache.org/jira/browse/SPARK-15272
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Lubomir Nerad
>
> Using Kafka direct {{DStream}} with simple window operation like:
> {code:java}
> kafkaDStream.window(Durations.milliseconds(1),
> Durations.milliseconds(1000));
> .print();
> {code}
> with 1s batch duration either freezes after several seconds or lags terribly 
> (depending on cluster mode).
> This happens when Kafka brokers are not part of the Spark cluster (they are 
> on different nodes). The {{KafkaRDD}} still reports them as preferred 
> locations. This doesn't seem to be problem in non-window scenarios but with 
> window it conflicts with delay scheduling algorithm implemented in 
> {{TaskSetManager}}. It either significantly delays (Yarn mode) or completely 
> drains (Spark mode) resource offers with {{TaskLocality.ANY}} which are 
> needed to process tasks with these Kafka broker aligned preferred locations. 
> When delay scheduling algorithm is switched off ({{spark.locality.wait=0}}), 
> the example works correctly.
> I think that the {{KafkaRDD}} shouldn't report preferred locations if the 
> brokers don't correspond to worker nodes or allow the reporting of preferred 
> locations to be switched off. Also it would be good if delay scheduling 
> algorithm didn't drain / delay offers in the case, the tasks have unmatched 
> preferred locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15272) DirectKafkaInputDStream doesn't work with window operation

2016-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570221#comment-15570221
 ] 

Cody Koeninger edited comment on SPARK-15272 at 10/12/16 11:33 PM:
---

Does the 0.10 consumer's handling of preferred locations 
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#locationstrategies
 address this for you?


was (Author: c...@koeninger.org):
Checking to see if the 0.10 consumer's handling of preferred locations 
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#locationstrategies
 addresses this.

> DirectKafkaInputDStream doesn't work with window operation
> --
>
> Key: SPARK-15272
> URL: https://issues.apache.org/jira/browse/SPARK-15272
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Lubomir Nerad
>
> Using Kafka direct {{DStream}} with simple window operation like:
> {code:java}
> kafkaDStream.window(Durations.milliseconds(1),
> Durations.milliseconds(1000));
> .print();
> {code}
> with 1s batch duration either freezes after several seconds or lags terribly 
> (depending on cluster mode).
> This happens when Kafka brokers are not part of the Spark cluster (they are 
> on different nodes). The {{KafkaRDD}} still reports them as preferred 
> locations. This doesn't seem to be problem in non-window scenarios but with 
> window it conflicts with delay scheduling algorithm implemented in 
> {{TaskSetManager}}. It either significantly delays (Yarn mode) or completely 
> drains (Spark mode) resource offers with {{TaskLocality.ANY}} which are 
> needed to process tasks with these Kafka broker aligned preferred locations. 
> When delay scheduling algorithm is switched off ({{spark.locality.wait=0}}), 
> the example works correctly.
> I think that the {{KafkaRDD}} shouldn't report preferred locations if the 
> brokers don't correspond to worker nodes or allow the reporting of preferred 
> locations to be switched off. Also it would be good if delay scheduling 
> algorithm didn't drain / delay offers in the case, the tasks have unmatched 
> preferred locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11698) Add option to ignore kafka messages that are out of limit rate

2016-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570208#comment-15570208
 ] 

Cody Koeninger commented on SPARK-11698:


Would a custom ConsumerStrategy for the new consumer added in SPARK-12177 allow 
you to address this issue?  You could supply a Consumer implementation that 
overrides poll

> Add option to ignore kafka messages that are out of limit rate
> --
>
> Key: SPARK-11698
> URL: https://issues.apache.org/jira/browse/SPARK-11698
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Liang-Chi Hsieh
>
> With spark.streaming.kafka.maxRatePerPartition, we can control the max rate 
> limit. However, we can not ignore these messages out of limit. These messages 
> will be consumed in next iteration. We have a use case that we need to ignore 
> these messages and process latest messages in next iteration.
> In other words, we simply want to consume part of messages in each iteration 
> and ignore remaining messages that are not consumed.
> We add an option for this purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17850) HadoopRDD should not swallow EOFException

2016-10-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17850:
-
Fix Version/s: 2.1.0
   2.0.2

> HadoopRDD should not swallow EOFException
> -
>
> Key: SPARK-17850
> URL: https://issues.apache.org/jira/browse/SPARK-17850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1
>Reporter: Shixiong Zhu
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> The code in 
> https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256
>  catches EOFException and mark RecordReader finished. However, in some cases, 
> RecordReader will throw EOFException to indicate the stream is corrupted. See 
> the following stack trace as an example:
> {code}
> Caused by: java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted 
> gzip files).
> Note: NewHadoopRDD doesn't have this issue.
> This is reported by Bilal Aslam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2016-10-12 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-10320.

   Resolution: Fixed
Fix Version/s: 2.0.0

SPARK-12177  added the new consumer, which supports SubscribePattern

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
> Fix For: 2.0.0
>
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9947) Separate Metadata and State Checkpoint Data

2016-10-12 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger closed SPARK-9947.
-
Resolution: Won't Fix

The direct DStream api already gives access to offsets, and it seems clear that 
 most future work on streaming checkpointing is going to be focused on 
structured streaming. SPARK-15406

> Separate Metadata and State Checkpoint Data
> ---
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570193#comment-15570193
 ] 

Apache Spark commented on SPARK-16827:
--

User 'dafrista' has created a pull request for this issue:
https://github.com/apache/spark/pull/15455

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) Clustering evaluator

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570185#comment-15570185
 ] 

Saikat Kanjilal commented on SPARK-14516:
-

Hello,
New to spark and am interested in helping with building a general purpose 
cluster evaluator, is the goal of this to use the metrics to evaluate overall 
clustering quality? [~akamal] [~josephkb] let me know how I can help.



> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2016-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570186#comment-15570186
 ] 

Cody Koeninger commented on SPARK-8337:
---

Can this be closed, given that the subtasks are resolved and any future 
discussion of python dstream kafka support seems to be in SPARK-16534

> KafkaUtils.createDirectStream for python is lacking API/feature parity with 
> the Scala/Java version
> --
>
> Key: SPARK-8337
> URL: https://issues.apache.org/jira/browse/SPARK-8337
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.4.0
>Reporter: Amit Ramesh
>Priority: Critical
>
> See the following thread for context.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5505) ConsumerRebalanceFailedException from Kafka consumer

2016-10-12 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger closed SPARK-5505.
-
Resolution: Won't Fix

The old kafka High Level Consumer has been abandoned at this point.  
SPARK-12177 and SPARK-15406 use the new consumer api.

> ConsumerRebalanceFailedException from Kafka consumer
> 
>
> Key: SPARK-5505
> URL: https://issues.apache.org/jira/browse/SPARK-5505
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
> Environment: CentOS6 / Linux 2.6.32-358.2.1.el6.x86_64
> java version "1.7.0_21"
> Scala compiler version 2.9.3
> 2 cores Intel(R) Xeon(R) CPU E5620  @ 2.40GHz / 16G RAM
> VMWare VM.
>Reporter: Greg Temchenko
>Priority: Critical
>
> From time to time Spark streaming produces a ConsumerRebalanceFailedException 
> and stops receiving messages. After that all consequential RDDs are empty.
> {code}
> 15/01/30 18:18:36 ERROR consumer.ZookeeperConsumerConnector: 
> [terran_vmname-1422670149779-243b4e10], error during syncedRebalance
> kafka.common.ConsumerRebalanceFailedException: 
> terran_vmname-1422670149779-243b4e10 can't rebalance after 4 retries
>   at 
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
>   at 
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:355)
> {code}
> The problem is also described in the mailing list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Spark-streaming-consumes-from-Kafka-td19570.html
> As I understand it's a critical blocker for kafka-spark streaming production 
> use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5718) Add native offset management for ReliableKafkaReceiver

2016-10-12 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-5718.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

SPARK-12177 added support for the native kafka offset commit api

> Add native offset management for ReliableKafkaReceiver
> --
>
> Key: SPARK-5718
> URL: https://issues.apache.org/jira/browse/SPARK-5718
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
> Fix For: 2.0.0
>
>
> Kafka 0.8.2 supports native offsets management instead of ZK, this will get 
> better performance, for now in ReliableKafkaReceiver, we rely on ZK to manage 
> the offsets, this potentially will be a bottleneck if the injection rate is 
> high (once per 200ms by default), so here in order to get better performance 
> as well as keeping consistent with Kafka, add native offset management for 
> ReliableKafkaReceiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17850) HadoopRDD should not swallow EOFException

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570154#comment-15570154
 ] 

Apache Spark commented on SPARK-17850:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15454

> HadoopRDD should not swallow EOFException
> -
>
> Key: SPARK-17850
> URL: https://issues.apache.org/jira/browse/SPARK-17850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1
>Reporter: Shixiong Zhu
>  Labels: correctness
>
> The code in 
> https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256
>  catches EOFException and mark RecordReader finished. However, in some cases, 
> RecordReader will throw EOFException to indicate the stream is corrupted. See 
> the following stack trace as an example:
> {code}
> Caused by: java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted 
> gzip files).
> Note: NewHadoopRDD doesn't have this issue.
> This is reported by Bilal Aslam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12372) Document limitations of MLlib local linear algebra

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570152#comment-15570152
 ] 

Saikat Kanjilal commented on SPARK-12372:
-

@josephkb, new to contributing to spark, is this something I can help with?

> Document limitations of MLlib local linear algebra
> --
>
> Key: SPARK-12372
> URL: https://issues.apache.org/jira/browse/SPARK-12372
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Christos Iraklis Tsatsoulis
>
> This JIRA is now for documenting limitations of MLlib's local linear algebra 
> types.  Basically, we should make it clear in the user guide that they 
> provide simple functionality but are not a full-fledged local linear library. 
>  We should also recommend libraries for users to use in the meantime: 
> probably Breeze for Scala (and Java?) and numpy/scipy for Python.
> *Original JIRA title*: Unary operator "-" fails for MLlib vectors
> *Original JIRA text, as an example of the need for better docs*:
> Consider the following snippet in pyspark 1.5.2:
> {code:none}
> >>> from pyspark.mllib.linalg import Vectors
> >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0])
> >>> x
> DenseVector([0.0, 1.0, 0.0, 7.0, 0.0])
> >>> -x
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: func() takes exactly 2 arguments (1 given)
> >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0])
> >>> y
> DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])
> >>> x-y
> DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0])
> >>> -y+x
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: func() takes exactly 2 arguments (1 given)
> >>> -1*x
> DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0])
> {code}
> Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors 
> for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} 
> behaves as expected.
> The last operation, {{-1*x}}, although mathematically "correct", includes 
> minus signs for the zero entries, which again is normally not expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16827.
-
   Resolution: Fixed
 Assignee: Brian Cho
Fix Version/s: 2.1.0

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570122#comment-15570122
 ] 

Saikat Kanjilal commented on SPARK-9487:


Hello All,
Can I help with this in anyway?
Thanks

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10815) API design: data sources and sinks

2016-10-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570139#comment-15570139
 ] 

Cody Koeninger commented on SPARK-10815:


Another unfortunate thing about the Sink api is that it only exposes batch ids, 
with no way that I'm aware of to get at (e.g. Kafka) offsets.

Access to offsets for sinks that can take advantage of it would be preferable, 
as it's better for disaster recovery and doesn't lock you in to a particular 
streaming engine.

> API design: data sources and sinks
> --
>
> Key: SPARK-10815
> URL: https://issues.apache.org/jira/browse/SPARK-10815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> The existing (in 2.0) source/sink interface for structured streaming depends 
> on RDDs. This dependency has two issues:
> 1. The RDD interface is wide and difficult to stabilize across versions. This 
> is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
> Ideally, a source/sink implementation created for Spark 2.x should work in 
> Spark 10.x, assuming the JVM is still around.
> 2. It is difficult to swap in/out a different execution engine.
> The purpose of this ticket is to create a stable interface that addresses the 
> above two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570125#comment-15570125
 ] 

Saikat Kanjilal edited comment on SPARK-14212 at 10/12/16 10:53 PM:


holdenk@ can I help out with this?


was (Author: kanjilal):
heldenk@ can I help out with this?

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14212) Add configuration element for --packages option

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570125#comment-15570125
 ] 

Saikat Kanjilal commented on SPARK-14212:
-

heldenk@ can I help out with this?

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570125#comment-15570125
 ] 

Saikat Kanjilal edited comment on SPARK-14212 at 10/12/16 10:54 PM:


@holdenk can I help out with this?


was (Author: kanjilal):
holdenk@ can I help out with this?

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17896) Dataset groupByKey + reduceGroups fails with codegen-related exception

2016-10-12 Thread Adam Breindel (JIRA)
Adam Breindel created SPARK-17896:
-

 Summary: Dataset groupByKey + reduceGroups fails with 
codegen-related exception
 Key: SPARK-17896
 URL: https://issues.apache.org/jira/browse/SPARK-17896
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
 Environment: Databricks, MacOS
Reporter: Adam Breindel


possible regression: works on 2.0, fails on 2.0.1
following code raises exception related to wholestage codegen:

case class Zip(city:String, zip:String, state:String)

val z1 = Zip("New York", "1", "NY")
val z2 = Zip("New York", "10001", "NY")
val z3 = Zip("Chicago", "60606", "IL")

val zips = sc.parallelize(Seq(z1, z2, z3)).toDS

zips.groupByKey(_.state).reduceGroups((z1, z2) => Zip("*", z1.zip + " " + 
z2.zip, z1.state)).show



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17782) Kafka 010 test is flaky

2016-10-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17782.
--
Resolution: Fixed
  Assignee: Cody Koeninger

Resolved by https://github.com/apache/spark/pull/15401

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Herman van Hovell
>Assignee: Cody Koeninger
> Fix For: 2.0.2, 2.1.0
>
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17782) Kafka 010 test is flaky

2016-10-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17782:
-
Fix Version/s: 2.1.0
   2.0.2

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Herman van Hovell
> Fix For: 2.0.2, 2.1.0
>
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17782) Kafka 010 test is flaky

2016-10-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17782:
-
Affects Version/s: 2.0.0
   2.0.1

> Kafka 010 test is flaky
> ---
>
> Key: SPARK-17782
> URL: https://issues.apache.org/jira/browse/SPARK-17782
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Herman van Hovell
> Fix For: 2.0.2, 2.1.0
>
>
> The Kafka 010 DirectKafkaStreamSuite {{pattern based subscription}} is flaky. 
> We should disable it, and figure out how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-12 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570042#comment-15570042
 ] 

Xiao Li commented on SPARK-17892:
-

Will do it! : )

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17894) Uniqueness of TaskSetManager name

2016-10-12 Thread Eren Avsarogullari (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-17894:
---
Description: 
TaskSetManager should have unique name to avoid adding duplicate ones to *Pool* 
via *SchedulableBuilder*. This problem surfaced with 
https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
https://github.com/apache/spark/pull/15326

There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
taskSet.Id covering both stageId and stageAttemptId looks to be used for 
TaskSetManager as well. 

*Current TaskSetManager Name* : 
{code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
*Sample*: TaskSet_0

*Proposed TaskSetManager Name* : 
{code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + stageAttemptId) 
{code}
*Sample* : TaskSet_0.0

cc [~kayousterhout] [~markhamstra]

  was:
TaskSetManager should have unique name to avoid adding duplicate ones to *Pool* 
via *SchedulableBuilder*. This problem surfaced with 
https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
https://github.com/apache/spark/pull/15326

There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
taskSet.Id covering both stageId and stageAttemptId looks to be used for 
TaskSetManager as well. 

What do you think about proposed TaskSetManager Name?

*Current TaskSetManager Name* : 
{code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
*Sample*: TaskSet_0

*Proposed TaskSetManager Name* : 
{code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + stageAttemptId) 
{code}
*Sample* : TaskSet_0.0


> Uniqueness of TaskSetManager name
> -
>
> Key: SPARK-17894
> URL: https://issues.apache.org/jira/browse/SPARK-17894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>
> TaskSetManager should have unique name to avoid adding duplicate ones to 
> *Pool* via *SchedulableBuilder*. This problem surfaced with 
> https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
> https://github.com/apache/spark/pull/15326
> There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
> taskSet.Id covering both stageId and stageAttemptId looks to be used for 
> TaskSetManager as well. 
> *Current TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
> *Sample*: TaskSet_0
> *Proposed TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + 
> stageAttemptId) {code}
> *Sample* : TaskSet_0.0
> cc [~kayousterhout] [~markhamstra]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2016-10-12 Thread Weiluo Ren (JIRA)
Weiluo Ren created SPARK-17895:
--

 Summary: Improve documentation of "rowsBetween" and "rangeBetween"
 Key: SPARK-17895
 URL: https://issues.apache.org/jira/browse/SPARK-17895
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SparkR, SQL
Reporter: Weiluo Ren
Priority: Minor


This is an issue found by [~junyangq] when he was fixing SparkR docs.

In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
 However, the description of "rangeBetween" does not clearly differentiate it 
from "rowsBetween". Even though in 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
 we have pretty nice description for "RangeFrame" and "RowFrame" which are used 
in "rangeBetween" and "rowsBetween", I cannot find them in the online Spark 
scala api. 

We could add small examples to the description of "rangeBetween" and 
"rowsBetween" like
{code}
val df = Seq(1,1,2).toDF("id")
df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
/**
 * It shows
 * +---+---+
 * | id|sum|
 * +---+---+
 * |  1|  4|
 * |  1|  4|
 * |  2|  2|
 * +---+---+
*/

df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
/**
 * It shows
 * +---+---+
 * | id|sum|
 * +---+---+
 * |  1|  2|
 * |  1|  3|
 * |  2|  2|
 * +---+---+
*/
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17894) Uniqueness of TaskSetManager name

2016-10-12 Thread Eren Avsarogullari (JIRA)
Eren Avsarogullari created SPARK-17894:
--

 Summary: Uniqueness of TaskSetManager name
 Key: SPARK-17894
 URL: https://issues.apache.org/jira/browse/SPARK-17894
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Eren Avsarogullari


TaskSetManager should have unique name to avoid adding duplicate ones to *Pool* 
via *SchedulableBuilder*. This problem surfaced with 
https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
https://github.com/apache/spark/pull/15326

There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
taskSet.Id covering both stageId and stageAttemptId looks to be used for 
TaskSetManager as well. 

What do you think about proposed TaskSetManager Name?

*Current TaskSetManager Name* : 
{code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
*Sample*: TaskSet_0

*Proposed TaskSetManager Name* : 
{code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + stageAttemptId) 
{code}
*Sample* : TaskSet_0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17675) Add Blacklisting of Executors & Nodes within one TaskSet

2016-10-12 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-17675.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15249
[https://github.com/apache/spark/pull/15249]

> Add Blacklisting of Executors & Nodes within one TaskSet
> 
>
> Key: SPARK-17675
> URL: https://issues.apache.org/jira/browse/SPARK-17675
> Project: Spark
>  Issue Type: Task
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.1.0
>
>
> This is a step along the way to SPARK-8425 -- see the design doc on that jira 
> for a complete discussion of blacklisting.
> To enable incremental review, the first step proposed here is to expand the 
> blacklisting within tasksets.  In particular, this will enable blacklisting 
> for
> * (task, executor) pairs (this already exists via an undocumented config)
> * (task, node)
> * (taskset, executor)
> * (taskset, node)
> In particular, adding (task, node) is critical to making spark fault-tolerant 
> of one-bad disk in a cluster, without requiring careful tuning of 
> "spark.task.maxFailures".  The other additions are also important to avoid 
> many misleading task failures and long scheduling delays when there is one 
> bad node on a large cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12787) Dataset to support custom encoder

2016-10-12 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569928#comment-15569928
 ] 

Aleksander Eskilson edited comment on SPARK-12787 at 10/12/16 9:37 PM:
---

[~Zariel], I've put together an implementation of an Avro encoder that I'm 
currently submitting to the folks at spark-avro [1]. You can read the broader 
story in a thread [2] on their project github. Additionally, the writing of 
encoders for additional types of Java objects might be made easier after 
resolving SPARK-17770.

[1] - https://github.com/databricks/spark-avro
[2] - https://github.com/databricks/spark-avro/issues/169


was (Author: aeskilson):
[~Zariel], I've put together an implementation of an Avro encoder that I'm 
currently submitting to the folks at spark-avro [1]. You can read the broader 
story here in a thread [2] on their project github. Additionally, the writing 
of encoders for additional types of Java objects might be made easier after 
resolving SPARK-17770.

[1] - https://github.com/databricks/spark-avro
[2] - https://github.com/databricks/spark-avro/issues/169

> Dataset to support custom encoder
> -
>
> Key: SPARK-12787
> URL: https://issues.apache.org/jira/browse/SPARK-12787
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>
> The current Dataset API allows to be loaded using a case-class that requires 
> the the attribute name and types to be match up precisely.
> It would be nicer, if a Partial function can be provided as a parameter to 
> transform the Dataframe like schema into Dataset. 
> Something like...
> test_dataframe.as[TestCaseClass](partial_function)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12787) Dataset to support custom encoder

2016-10-12 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569928#comment-15569928
 ] 

Aleksander Eskilson commented on SPARK-12787:
-

[~Zariel], I've put together an implementation of an Avro encoder that I'm 
currently submitting to the folks at spark-avro [1]. You can read the broader 
story here in a thread [2] on their project github. Additionally, the writing 
of encoders for additional types of Java objects might be made easier after 
resolving SPARK-17770.

[1] - https://github.com/databricks/spark-avro
[2] - https://github.com/databricks/spark-avro/issues/169

> Dataset to support custom encoder
> -
>
> Key: SPARK-12787
> URL: https://issues.apache.org/jira/browse/SPARK-12787
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>
> The current Dataset API allows to be loaded using a case-class that requires 
> the the attribute name and types to be match up precisely.
> It would be nicer, if a Partial function can be provided as a parameter to 
> transform the Dataframe like schema into Dataset. 
> Something like...
> test_dataframe.as[TestCaseClass](partial_function)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-12 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-16827:

Summary: Stop reporting spill metrics as shuffle metrics  (was: Query with 
Join produces excessive amount of shuffle data)

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17834:
-
Target Version/s: 2.0.2, 2.1.0

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17770:


Assignee: (was: Apache Spark)

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17770:


Assignee: Apache Spark

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Assignee: Apache Spark
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569840#comment-15569840
 ] 

Apache Spark commented on SPARK-17770:
--

User 'bdrillard' has created a pull request for this issue:
https://github.com/apache/spark/pull/15453

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-17770:

Affects Version/s: 2.0.0

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-17770:

Target Version/s:   (was: 2.0.2)

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-17770:

Priority: Major  (was: Critical)

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Aleksander Eskilson
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17770) Make ObjectType SQL Type Public

2016-10-12 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-17770:

Priority: Critical  (was: Minor)

> Make ObjectType SQL Type Public
> ---
>
> Key: SPARK-17770
> URL: https://issues.apache.org/jira/browse/SPARK-17770
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> Currently Catalyst supports encoding custom classes represented as Java Beans 
> (among others). This Java Bean implementation depends internally on 
> Catalyst’s ObjectType extension of DataType. Right now, this class is private 
> to the sql package [1]. However, its private scope makes it more difficult to 
> write full custom encoders for other classes, themselves perhaps composed of 
> additional objects.
> Opening this class as public will facilitate the writing of custom encoders.
> [1] -- 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala#L39



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-12 Thread Guo-Xun Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569792#comment-15569792
 ] 

Guo-Xun Yuan commented on SPARK-12664:
--

I would also vote this as an important features. Also, Is there any reason why 
MultilayerPerceptronClassifier is derived from `Predictor` and 
MultilayerPerceptronClassificationModel is derived from `PredictionModel` 
rather than `Classifier` and `ClassificationModel` ??

Btw, the same things also happen to GBTClassifier and GBTClassificationModel.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-12 Thread Guo-Xun Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569792#comment-15569792
 ] 

Guo-Xun Yuan edited comment on SPARK-12664 at 10/12/16 8:31 PM:


I would also vote this as an important feature. Also, Is there any reason why 
MultilayerPerceptronClassifier is derived from `Predictor` and 
MultilayerPerceptronClassificationModel is derived from `PredictionModel` 
rather than `Classifier` and `ClassificationModel` ??

Btw, the same things also happen to GBTClassifier and GBTClassificationModel.


was (Author: eaudex):
I would also vote this as an important features. Also, Is there any reason why 
MultilayerPerceptronClassifier is derived from `Predictor` and 
MultilayerPerceptronClassificationModel is derived from `PredictionModel` 
rather than `Classifier` and `ClassificationModel` ??

Btw, the same things also happen to GBTClassifier and GBTClassificationModel.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17893) Window functions should also allow looking back in time

2016-10-12 Thread Raviteja Lokineni (JIRA)
Raviteja Lokineni created SPARK-17893:
-

 Summary: Window functions should also allow looking back in time
 Key: SPARK-17893
 URL: https://issues.apache.org/jira/browse/SPARK-17893
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Raviteja Lokineni


This function should allow looking back. The current window(timestamp, 
duration) seems to be for looking forward in time.

Example:
{code}dataFrame.groupBy(window("date", "7 days ago")).agg(min("col1"), 
max("col1")){code}

For example, if date: 2013-01-07 then the window should be 2013-01-01 - 
2013-01-07



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17892:
-
Assignee: Xiao Li

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569587#comment-15569587
 ] 

Yin Huai commented on SPARK-17892:
--

cc [~smilegator]

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Priority: Blocker
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-12 Thread Yin Huai (JIRA)
Yin Huai created SPARK-17892:


 Summary: Query in CTAS is Optimized Twice (branch-2.0)
 Key: SPARK-17892
 URL: https://issues.apache.org/jira/browse/SPARK-17892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Yin Huai
Priority: Blocker


This tracks the work that fixes the problem shown in  
https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause

2016-10-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17863:
-
Priority: Blocker  (was: Critical)

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17891) SQL-based three column join loses first column

2016-10-12 Thread Eli Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Miller updated SPARK-17891:
---
Attachment: test.tgz

A simple maven project that contains TripleJoin.java and sample data files 
which illustrate the problem. Requires Java 1.8.

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Question
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17891) SQL-based three column join loses first column

2016-10-12 Thread Eli Miller (JIRA)
Eli Miller created SPARK-17891:
--

 Summary: SQL-based three column join loses first column
 Key: SPARK-17891
 URL: https://issues.apache.org/jira/browse/SPARK-17891
 Project: Spark
  Issue Type: Question
Affects Versions: 2.0.1
Reporter: Eli Miller


Hi all,

I hope that this is not a known issue (I haven't had any luck finding anything 
similar in Jira or the mailing lists but I could be searching with the wrong 
terms). I just started to experiment with Spark SQL and am seeing what appears 
to be a bug. When using Spark SQL to join two tables with a three column inner 
join, the first column join is ignored. The example code that I have starts 
with two tables *T1*:

{noformat}
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  1|  2|  3|  4|
+---+---+---+---+
{noformat}

and *T2*:

{noformat}
+---+---+---+---+
|  b|  c|  d|  e|
+---+---+---+---+
|  2|  3|  4|  5|
| -2|  3|  4|  6|
|  2| -3|  4|  7|
+---+---+---+---+
{noformat}

Joining *T1* to *T2* on *b*, *c* and *d* (in that order):

{code:sql}
SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
FROM t1, t2
WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
{code}

results in the following (note that *T1.b* != *T2.b* in the first row):

{noformat}
+---+---+---+---+---+---+---+---+
|  a|  b|  b|  c|  c|  d|  d|  e|
+---+---+---+---+---+---+---+---+
|  1|  2| -2|  3|  3|  4|  4|  6|
|  1|  2|  2|  3|  3|  4|  4|  5|
+---+---+---+---+---+---+---+---+
{noformat}

Switching the predicate order to *c*, *b* and *d*:

{code:sql}
SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
FROM t1, t2
WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
{code}

yields different results (now *T1.c* != *T2.c* in the first row):

{noformat}
+---+---+---+---+---+---+---+---+
|  a|  b|  b|  c|  c|  d|  d|  e|
+---+---+---+---+---+---+---+---+
|  1|  2|  2|  3| -3|  4|  4|  7|
|  1|  2|  2|  3|  3|  4|  4|  5|
+---+---+---+---+---+---+---+---+
{noformat}

Is this expected?

I started to research this a bit and one thing that jumped out at me was the 
ordering of the HashedRelationBroadcastMode concatenation in the plan (this is 
from the *b*, *c*, *d* ordering):

{noformat}
...
*Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
+- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
   :- *Project [a#0, b#1, c#2, d#3]
   :  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
   : +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
struct
   +- BroadcastExchange 
HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, true] 
as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 32) | 
(cast(input[2, int, true] as bigint) & 4294967295
  +- *Project [b#9, c#10, d#11, e#12]
 +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
+- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
struct]
{noformat}

If this concatenated byte array is ever truncated to 64 bits in a comparison, 
the leading column will be lost and could result in this behavior.

I will attach my example code and data. Please let me know if I can provide any 
other details.

Best regards,
Eli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569493#comment-15569493
 ] 

Kazuaki Ishizaki commented on SPARK-16845:
--

Thank you for preparing the case. I noticed that the following small code can 
reproduce the same exception.
{code}
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
{code}

{code}
val sortOrder = Literal("abc").asc
GenerateOrdering.generate(Array.fill(450)(sortOrder))
{code}

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-12 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-16845:
-
Comment: was deleted

(was: Thank you for preparing the case. I noticed that the following small code 
can reproduce the same exception.
{code}
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
{code}

{code}
val sortOrder = Literal("abc").asc
GenerateOrdering.generate(Array.fill(450)(sortOrder))
{code})

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569489#comment-15569489
 ] 

Kazuaki Ishizaki commented on SPARK-16845:
--

Thank you for preparing the case. I noticed that the following small code can 
reproduce the same exception.
{code}
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
{code}

{code}
val sortOrder = Literal("abc").asc
GenerateOrdering.generate(Array.fill(450)(sortOrder))
{code}

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17840) Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE

2016-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17840.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0

> Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in 
> PULL_REQUEST_TEMPLATE
> --
>
> Key: SPARK-17840
> URL: https://issues.apache.org/jira/browse/SPARK-17840
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.0
>
>
> It seems it'd better to give best efforts for describing the guides for 
> creating JIRAs and PRs as there are several discussions about this.
> Maybe,
>  - We could add some pointers of {{CONTRIBUTING.md}} or wiki[1] in README.md 
>  - Adds an explicit warning in PULL_REQUEST_TEMPLATE[2], in particular, for 
> some PRs trying to merge a branch to another branch, for example as below:
> {quote}
> Please double check if your pull request is from a branch to a branch. It is 
> probably wrong if the pull request title is "Branch 2.0" or "Merge ..." or if 
> your pull request includes many commit logs unrelated to your changes to 
> propose. 
> {quote}
> Please refer the mailing list[3].
> [1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> [2] https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
> [3] 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201610.mbox/%3CCAMAsSdKYEvvU%2BPdZ0oFDsvwod6wRRavqRyYmjc7qYhEp7%2B_3eg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17790) Support for parallelizing R data.frame larger than 2GB

2016-10-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-17790.
--
   Resolution: Fixed
 Assignee: Hossein Falaki
Fix Version/s: 2.1.0
   2.0.2

> Support for parallelizing R data.frame larger than 2GB
> --
>
> Key: SPARK-17790
> URL: https://issues.apache.org/jira/browse/SPARK-17790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.0.2, 2.1.0
>
>
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to 
> do because the limit exists both in R and JVM (because we receive data as a 
> ByteArray). However, to support parallalizing R data.frames that are larger 
> than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has 
> worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17814) spark submit arguments are truncated in yarn-cluster mode

2016-10-12 Thread shreyas subramanya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569349#comment-15569349
 ] 

shreyas subramanya commented on SPARK-17814:


yarn version is 2.6.4

> spark submit arguments are truncated in yarn-cluster mode
> -
>
> Key: SPARK-17814
> URL: https://issues.apache.org/jira/browse/SPARK-17814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 1.6.1
>Reporter: shreyas subramanya
>Priority: Minor
>
> {noformat}
> One of our customers is trying to pass in json through spark-submit as 
> follows:
> spark-submit --verbose --class SimpleClass --master yarn-cluster ./simple.jar 
> "{\"mode\":\"wf\", \"arrays\":{\"array\":[1]}}"
> The application reports the passed arguments as: {"mode":"wf", 
> "arrays":{"array":[1]
> If the same application is submitted in yarn-client mode, as follows:
> spark-submit --verbose --class SimpleClass --master yarn-client ./simple.jar 
> "{\"mode\":\"wf\", \"arrays\":{\"array\":[1]}}"
> The application reports the passed args as: {"mode":"wf", 
> "arrays":{"array":[1]}}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-12 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569325#comment-15569325
 ] 

Felix Cheung commented on SPARK-17781:
--

Thanks for the investigation. This might seem like a R thing? I'll try to 
reproduce this in R 

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17884) In the cast expression, casting from empty string to interval type throws NullPointerException

2016-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17884.
-
   Resolution: Fixed
 Assignee: Priyanka Garg
Fix Version/s: 2.1.0
   2.0.2

> In the cast expression, casting from empty string to interval type throws 
> NullPointerException
> --
>
> Key: SPARK-17884
> URL: https://issues.apache.org/jira/browse/SPARK-17884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Priyanka Garg
>Assignee: Priyanka Garg
> Fix For: 2.0.2, 2.1.0
>
>
> When the cast expression is applied on empty string "" to cast it to interval 
> type it throws Null pointer exception..
> Getting the same exception when I tried reproducing the same through test case
> checkEvaluation(Cast(Literal(""), CalendarIntervalType), null)
> Exception i am getting is:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:254)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvalutionWithUnsafeProjection(ExpressionEvalHelper.scala:181)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvalutionWithUnsafeProjection(CastSuite.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvaluation(ExpressionEvalHelper.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvaluation(CastSuite.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply$mcV$sp(CastSuite.scala:770)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 

[jira] [Resolved] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

2016-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14761.
-
   Resolution: Fixed
 Assignee: Bijay Kumar Pathak
Fix Version/s: 2.1.0

> PySpark DataFrame.join should reject invalid join methods even when join 
> columns are not specified
> --
>
> Key: SPARK-14761
> URL: https://issues.apache.org/jira/browse/SPARK-14761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Assignee: Bijay Kumar Pathak
>Priority: Minor
>  Labels: starter
> Fix For: 2.1.0
>
>
> In PySpark, the following invalid DataFrame join will not result an error:
> {code}
> df1.join(df2, how='not-a-valid-join-type')
> {code}
> The signature for `join` is
> {code}
> def join(self, other, on=None, how=None):
> {code}
> and its code ends up completely skipping handling of the `how` parameter when 
> `on` is `None`:
> {code}
>  if on is not None and not isinstance(on, list):
> on = [on]
> if on is None or len(on) == 0:
> jdf = self._jdf.join(other._jdf)
> elif isinstance(on[0], basestring):
> if how is None:
> jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
> else:
> assert isinstance(how, basestring), "how should be basestring"
> jdf = self._jdf.join(other._jdf, self._jseq(on), how)
> else:
> {code}
> Given that this behavior can mask user errors (as in the above example), I 
> think that we should refactor this to first process all arguments and then 
> call the three-argument {{_.jdf.join}}. This would handle the above invalid 
> example by passing all arguments to the JVM DataFrame for analysis.
> I'm not planning to work on this myself, so this bugfix (+ regression test!) 
> is up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-12 Thread K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569265#comment-15569265
 ] 

K commented on SPARK-16845:
---

Code and data are also here as well. 
https://issues.apache.org/jira/browse/SPARK-17223
Thanks for looking into it!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17890) scala.ScalaReflectionException

2016-10-12 Thread Khalid Reid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khalid Reid updated SPARK-17890:

Description: 
Hello,

I am seeing an error message in spark-shell when I map a DataFrame to a 
Seq\[Foo\].  However, things work fine when I use flatMap.  

{noformat}
scala> case class Foo(value:String)
defined class Foo

scala> val df = sc.parallelize(List(1,2,3)).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.map{x => Seq.empty[Foo]}
scala.ScalaReflectionException: object $line14.$read not found.
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
  at $typecreator1$1.apply(:29)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
  at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
  ... 48 elided

scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
res2: org.apache.spark.sql.Dataset[Foo] = [value: string]

{noformat}

I am seeing the same error reported 
[here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
 when I use spark-submit.

I am new to Spark but I don't expect this to throw an exception.

Thanks.





  was:
Hello,

I am seeing a very cryptic error message in spark-shell when I map a DataFrame 
to a Seq\[Foo\].  However, things work fine when I use flatMap.  

{noformat}
scala> case class Foo(value:String)
defined class Foo

scala> val df = sc.parallelize(List(1,2,3)).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.map{x => Seq.empty[Foo]}
scala.ScalaReflectionException: object $line14.$read not found.
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
  at $typecreator1$1.apply(:29)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
  at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
  ... 48 elided

scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
res2: org.apache.spark.sql.Dataset[Foo] = [value: string]

{noformat}

I am seeing the same error reported 
[here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
 when I use spark-submit.

I am new to Spark but I don't expect this to throw an exception.

Thanks.






> scala.ScalaReflectionException
> --
>
> Key: SPARK-17890
> URL: https://issues.apache.org/jira/browse/SPARK-17890
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>Reporter: Khalid Reid
>Priority: Minor
>  Labels: newbie
>
> Hello,
> I am seeing an error message in spark-shell when I map a DataFrame to a 
> Seq\[Foo\].  However, things work fine when I use flatMap.  
> {noformat}
> scala> case class Foo(value:String)
> defined class Foo
> scala> val df = sc.parallelize(List(1,2,3)).toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
> scala> df.map{x => Seq.empty[Foo]}
> scala.ScalaReflectionException: object $line14.$read not found.
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
>   at $typecreator1$1.apply(:29)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
>   at 
> 

[jira] [Updated] (SPARK-17890) scala.ScalaReflectionException

2016-10-12 Thread Khalid Reid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khalid Reid updated SPARK-17890:

Description: 
Hello,

I am seeing a very cryptic error message in spark-shell when I map a DataFrame 
to a Seq\[Foo\].  However, things work fine when I use flatMap.  

{noformat}
scala> case class Foo(value:String)
defined class Foo

scala> val df = sc.parallelize(List(1,2,3)).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.map{x => Seq.empty[Foo]}
scala.ScalaReflectionException: object $line14.$read not found.
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
  at $typecreator1$1.apply(:29)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
  at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
  ... 48 elided

scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
res2: org.apache.spark.sql.Dataset[Foo] = [value: string]

{noformat}

I am seeing the same error reported 
[here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
 when I use spark-submit.

I am new to Spark but I don't expect this to throw an exception.

Thanks.





  was:
Hello,

I am seeing a very cryptic error message in spark-shell when I map a DataFrame 
to a Seq\[Foo\].  However, things work fine when I use flatMap.  

{noformat}
scala> case class Foo(value:String)
defined class Foo

scala> val df = sc.parallelize(List(1,2,3)).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.map{x => Seq.empty[Foo]}
scala.ScalaReflectionException: object $line14.$read not found.
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
  at $typecreator1$1.apply(:29)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
  at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
  ... 48 elided

scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
res2: org.apache.spark.sql.Dataset[Foo] = [value: string]

{noformat}

I am seeing the same error reported 
[here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
 when I use spark-submit.

I am new to Spark so if this an invalid code (though I don't see why it would 
be) then I'd expect to see a better error message.






> scala.ScalaReflectionException
> --
>
> Key: SPARK-17890
> URL: https://issues.apache.org/jira/browse/SPARK-17890
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>Reporter: Khalid Reid
>Priority: Minor
>  Labels: newbie
>
> Hello,
> I am seeing a very cryptic error message in spark-shell when I map a 
> DataFrame to a Seq\[Foo\].  However, things work fine when I use flatMap.  
> {noformat}
> scala> case class Foo(value:String)
> defined class Foo
> scala> val df = sc.parallelize(List(1,2,3)).toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
> scala> df.map{x => Seq.empty[Foo]}
> scala.ScalaReflectionException: object $line14.$read not found.
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
>   at $typecreator1$1.apply(:29)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> 

[jira] [Comment Edited] (SPARK-17883) Possible typo in comments of Row.scala

2016-10-12 Thread Weiluo Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569203#comment-15569203
 ] 

Weiluo Ren edited comment on SPARK-17883 at 10/12/16 5:01 PM:
--

Sure, I will create a JIRA for this typo and others related (if any).

Update: Scanned through class Row and its subclasses. No related doc issue 
appears. So I will just create a tiny PR on this single fix.


was (Author: weiluo_ren123):
Sure, I will create a JIRA for this typo and others related (if any).

> Possible typo in comments of Row.scala
> --
>
> Key: SPARK-17883
> URL: https://issues.apache.org/jira/browse/SPARK-17883
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Weiluo Ren
>Priority: Trivial
>
> The description of the private method
> {code}private def getAnyValAs[T <: AnyVal](i: Int): T{code}
> says "Returns the value of a given fieldName." on 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L465
> It should be "Returns the value at position i." instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17890) scala.ScalaReflectionException

2016-10-12 Thread Khalid Reid (JIRA)
Khalid Reid created SPARK-17890:
---

 Summary: scala.ScalaReflectionException
 Key: SPARK-17890
 URL: https://issues.apache.org/jira/browse/SPARK-17890
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
 Environment: x86_64 GNU/Linux
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Reporter: Khalid Reid
Priority: Minor


Hello,

I am seeing a very cryptic error message in spark-shell when I map a DataFrame 
to a Seq\[Foo\].  However, things work fine when I use flatMap.  

{noformat}
scala> case class Foo(value:String)
defined class Foo

scala> val df = sc.parallelize(List(1,2,3)).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.map{x => Seq.empty[Foo]}
scala.ScalaReflectionException: object $line14.$read not found.
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
  at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
  at $typecreator1$1.apply(:29)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
  at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
  at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
  at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
  ... 48 elided

scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
res2: org.apache.spark.sql.Dataset[Foo] = [value: string]

{noformat}

I am seeing the same error reported 
[here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
 when I use spark-submit.

I am new to Spark so if this an invalid code (though I don't see why it would 
be) then I'd expect to see a better error message.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17883) Possible typo in comments of Row.scala

2016-10-12 Thread Weiluo Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569203#comment-15569203
 ] 

Weiluo Ren commented on SPARK-17883:


Sure, I will create a JIRA for this typo and others related (if any).

> Possible typo in comments of Row.scala
> --
>
> Key: SPARK-17883
> URL: https://issues.apache.org/jira/browse/SPARK-17883
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Weiluo Ren
>Priority: Trivial
>
> The description of the private method
> {code}private def getAnyValAs[T <: AnyVal](i: Int): T{code}
> says "Returns the value of a given fieldName." on 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L465
> It should be "Returns the value at position i." instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17827) StatisticsColumnSuite failures on big endian platforms

2016-10-12 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569048#comment-15569048
 ] 

Pete Robbins commented on SPARK-17827:
--

So this looks like the max field is being written as an Int into the UnsafeRow 
but is later read as a Long. Code stack to the write:

java.lang.Thread.dumpStack(Thread.java:462)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:149)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:232)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:221)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.foreach(AggregationIterator.scala:35)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.to(AggregationIterator.scala:35)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.toBuffer(AggregationIterator.scala:35)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.toArray(AggregationIterator.scala:35)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1927)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1927)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)

> StatisticsColumnSuite failures on big endian platforms
> --
>
> Key: SPARK-17827
> URL: https://issues.apache.org/jira/browse/SPARK-17827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: big endian
>Reporter: Pete Robbins
>  Labels: big-endian
>
> https://issues.apache.org/jira/browse/SPARK-17073
> introduces new tests/function that fails on big endian platforms
> Failing tests:
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> string column
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> binary column
>  org.apache.spark.sql.StatisticsColumnSuite.column-level statistics for 
> columns with different types
>  org.apache.spark.sql.hive.StatisticsSuite.generate column-level statistics 
> and load them from hive metastore
> all fail in checkColStat eg: 
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:156)
>   at 
> org.apache.spark.sql.StatisticsTest$.checkColStat(StatisticsTest.scala:92)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1$$anonfun$apply$mcV$sp$1.apply(StatisticsTest.scala:43)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1$$anonfun$apply$mcV$sp$1.apply(StatisticsTest.scala:40)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.StatisticsTest$$anonfun$checkColStats$1.apply$mcV$sp(StatisticsTest.scala:40)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:168)
>   at 
> org.apache.spark.sql.StatisticsColumnSuite.withTable(StatisticsColumnSuite.scala:30)
>   at 
> org.apache.spark.sql.StatisticsTest$class.checkColStats(StatisticsTest.scala:33)
>   at 
> 

  1   2   >