[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095588#comment-15095588
 ] 

Sun Rui commented on SPARK-6817:


Attached the first draft design doc, please review and give comments

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095590#comment-15095590
 ] 

Sun Rui commented on SPARK-6817:


[~mpollock], this PR will support row-based UDF. UDF operating on columns may 
be supported after R UDAF is supported.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12373) Type coercion rule of dividing two decimal values may choose an intermediate precision that does not have enough number of digits at the left of decimal point

2016-01-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12373:
-
Target Version/s: 2.0.0  (was: 1.6.1, 2.0.0)

> Type coercion rule of dividing two decimal values may choose an intermediate 
> precision that does not have enough number of digits at the left of decimal 
> point 
> ---
>
> Key: SPARK-12373
> URL: https://issues.apache.org/jira/browse/SPARK-12373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like the {{widerDecimalType}} at 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L432
>  can produce something like {{(38, 38)}} when we have have two operand types 
> {{Decimal(38, 0)}} and {{Decimal(38, 38)}}. We should take a look at if there 
> is more reasonable way to handle precision/scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join

2016-01-12 Thread mayxine (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mayxine updated SPARK-10538:

Attachment: java.lang.NegativeArraySizeException.png

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Davies Liu
> Attachments: java.lang.NegativeArraySizeException.png, 
> screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095600#comment-15095600
 ] 

Sun Rui commented on SPARK-6817:


[~shivaram] I first focus on the row-based UDF functionality. For high-level 
APIs() like dapply(), I think that needs support of UDAF, which is not 
supported in this PR yet. I can create a new JIRA for supporting R UDAF. Any 
comments?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12558) AnalysisException when multiple functions applied in GROUP BY clause

2016-01-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12558:
-
Assignee: Dilip Biswal

> AnalysisException when multiple functions applied in GROUP BY clause
> 
>
> Key: SPARK-12558
> URL: https://issues.apache.org/jira/browse/SPARK-12558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Dilip Biswal
>
> Hi,
> I have following issue when trying to use functions in group by clause. 
> Example:
> {code}
> sqlCtx = HiveContext(sc)
> rdd = sc.parallelize([{'test_date': 1451400761}])
> df = sqlCtx.createDataFrame(rdd)
> df.registerTempTable("df")
> {code}
> Now, where I'm using single function it's OK.
> {code}
> sqlCtx.sql("select cast(test_date as timestamp) from df group by 
> cast(test_date as timestamp)").collect()
> [Row(test_date=datetime.datetime(2015, 12, 29, 15, 52, 41))]
> {code}
> Where I'm using more than one function I'm getting AnalysisException
> {code}
> sqlCtx.sql("select date(cast(test_date as timestamp)) from df group by 
> date(cast(test_date as timestamp))").collect()
> Py4JJavaError: An error occurred while calling o38.sql.
> : org.apache.spark.sql.AnalysisException: expression 'test_date' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12796) initial prototype: projection/filter/range

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12796:


Assignee: Apache Spark  (was: Davies Liu)

> initial prototype: projection/filter/range
> --
>
> Key: SPARK-12796
> URL: https://issues.apache.org/jira/browse/SPARK-12796
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12796) initial prototype: projection/filter/range

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095710#comment-15095710
 ] 

Apache Spark commented on SPARK-12796:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10735

> initial prototype: projection/filter/range
> --
>
> Key: SPARK-12796
> URL: https://issues.apache.org/jira/browse/SPARK-12796
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12796) initial prototype: projection/filter/range

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12796:


Assignee: Davies Liu  (was: Apache Spark)

> initial prototype: projection/filter/range
> --
>
> Key: SPARK-12796
> URL: https://issues.apache.org/jira/browse/SPARK-12796
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM:
-

[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't think the UDF should depend 
on RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.




was (Author: rxin):
[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't want the UDF to depend on 
RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:58 AM:
-

[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't think the UDF should depend 
on RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.

In order to support the row-oriented API efficiently, we'd need to replicate 
all the infrastructure built for Python. I don't think that is maintainable in 
the long run.



was (Author: rxin):
[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't think the UDF should depend 
on RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12792) Refactor RRDD to support R UDF

2016-01-12 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12792:
---

 Summary: Refactor RRDD to support R UDF
 Key: SPARK-12792
 URL: https://issues.apache.org/jira/browse/SPARK-12792
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.6.0
Reporter: Sun Rui


Extract the logic in compute() to a new class named RRunner, similar to 
PythonRunner in the PythonRDD class. It can be used to run R UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095594#comment-15095594
 ] 

Sun Rui commented on SPARK-6817:


[~piccolbo] I am not sure If I understand your meaning. This is to support UDF 
in R code. Spark has already supported Scala/Python UDF.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12797) Aggregation without grouping keys

2016-01-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12797:
--

 Summary: Aggregation without grouping keys
 Key: SPARK-12797
 URL: https://issues.apache.org/jira/browse/SPARK-12797
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12796) initial prototype: projection/filter/range

2016-01-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12796:
--

 Summary: initial prototype: projection/filter/range
 Key: SPARK-12796
 URL: https://issues.apache.org/jira/browse/SPARK-12796
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734
 ] 

Jeff Zhang edited comment on SPARK-6817 at 1/13/16 7:09 AM:


+1 on block based API, UDF would usually call other R packages and most of R 
packages are block based (R's dataframe), and this lead performance gain.


was (Author: zjffdu):
+1 on block based API, UDF would usually call other R packages and most of R 
packages are for block based (R's dataframe), and this lead performance gain.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Weiqiang Zhuang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095756#comment-15095756
 ] 

Weiqiang Zhuang commented on SPARK-6817:


We did see both apply use cases. But the block/group/column oriented apply is 
more important if we can have it earlier.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-6817:
---
Attachment: SparkR UDF Design Documentation v1.pdf

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12558) AnalysisException when multiple functions applied in GROUP BY clause

2016-01-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12558.
--
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10520
[https://github.com/apache/spark/pull/10520]

> AnalysisException when multiple functions applied in GROUP BY clause
> 
>
> Key: SPARK-12558
> URL: https://issues.apache.org/jira/browse/SPARK-12558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Dilip Biswal
> Fix For: 2.0.0, 1.6.1
>
>
> Hi,
> I have following issue when trying to use functions in group by clause. 
> Example:
> {code}
> sqlCtx = HiveContext(sc)
> rdd = sc.parallelize([{'test_date': 1451400761}])
> df = sqlCtx.createDataFrame(rdd)
> df.registerTempTable("df")
> {code}
> Now, where I'm using single function it's OK.
> {code}
> sqlCtx.sql("select cast(test_date as timestamp) from df group by 
> cast(test_date as timestamp)").collect()
> [Row(test_date=datetime.datetime(2015, 12, 29, 15, 52, 41))]
> {code}
> Where I'm using more than one function I'm getting AnalysisException
> {code}
> sqlCtx.sql("select date(cast(test_date as timestamp)) from df group by 
> date(cast(test_date as timestamp))").collect()
> Py4JJavaError: An error occurred while calling o38.sql.
> : org.apache.spark.sql.AnalysisException: expression 'test_date' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12798) Broadcast hash join

2016-01-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12798:
--

 Summary: Broadcast hash join
 Key: SPARK-12798
 URL: https://issues.apache.org/jira/browse/SPARK-12798
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12728) Integrate SQL generation feature with native view

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12728:


Assignee: (was: Apache Spark)

> Integrate SQL generation feature with native view
> -
>
> Key: SPARK-12728
> URL: https://issues.apache.org/jira/browse/SPARK-12728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12785) Implement columnar in memory representation

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12785.
-
   Resolution: Fixed
 Assignee: Nong Li
Fix Version/s: 2.0.0

> Implement columnar in memory representation
> ---
>
> Key: SPARK-12785
> URL: https://issues.apache.org/jira/browse/SPARK-12785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> Tungsten can benefit from having a columnar in memory representation which 
> can provide a few benefits:
>  - Enables vectorized execution
>  - Improves memory efficiency (memory is more tightly packed)
>  - Enables cheap serialization/zero-copy transfer with third party components 
> (e.g. numpy)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-12 Thread Andrew Or (JIRA)
Andrew Or created SPARK-12790:
-

 Summary: Remove HistoryServer old multiple files format
 Key: SPARK-12790
 URL: https://issues.apache.org/jira/browse/SPARK-12790
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Reporter: Andrew Or


HistoryServer has 2 formats. The old one makes a directory and puts multiple 
files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 1 
file called local_2593759238651.log or something.

It's been a nightmare to maintain both code paths. We should just remove the 
old legacy format (which has been out of use for many versions now) when we 
still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095537#comment-15095537
 ] 

Apache Spark commented on SPARK-12791:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10734

> Simplify CaseWhen by breaking "branches" into "conditions" and "values"
> ---
>
> Key: SPARK-12791
> URL: https://issues.apache.org/jira/browse/SPARK-12791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12791:


Assignee: Apache Spark  (was: Reynold Xin)

> Simplify CaseWhen by breaking "branches" into "conditions" and "values"
> ---
>
> Key: SPARK-12791
> URL: https://issues.apache.org/jira/browse/SPARK-12791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095605#comment-15095605
 ] 

Sun Rui commented on SPARK-12172:
-

As Spark is migrating from RDD API to Dataset API, after Dataset API is 
supported in SparkR, we can remove RDD API

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095605#comment-15095605
 ] 

Sun Rui edited comment on SPARK-12172 at 1/13/16 4:50 AM:
--

As Spark is migrating from RDD API to Dataset API, after Dataset API is 
supported in SparkR, we can remove RDD API. But I am not sure if Dataset API is 
mature enough in 2.0.


was (Author: sunrui):
As Spark is migrating from RDD API to Dataset API, after Dataset API is 
supported in SparkR, we can remove RDD API

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734
 ] 

Jeff Zhang commented on SPARK-6817:
---

+1 on block based API, UDF would usually call other R packages and most of R 
packages are for block based (R's dataframe), and this lead performance gain.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12800) Subtle bug on Spark Yarn Client under Kerberos Security Mode

2016-01-12 Thread Chester (JIRA)
Chester created SPARK-12800:
---

 Summary: Subtle bug on Spark Yarn Client under Kerberos Security 
Mode
 Key: SPARK-12800
 URL: https://issues.apache.org/jira/browse/SPARK-12800
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2, 1.5.1
Reporter: Chester


Version used: Spark 1.5.1 (1.5.2-SNAPSHOT) 
Deployment Mode: Yarn-Cluster
Problem observed: 
  When running spark job directly from YarnClient (without using spark-submit, 
I did not verify the spark-submit has the same issue or not), when kerberos 
security is enabled, the first time run spark job always fail. The failure is 
due to that the hadoop consider the job is in SIMPLE model rather than Kerberos 
mode.  But without shutting down the JVM, run the same job again, the spark job 
will pass. If one restart the JVM, then the spark job will fail again. 

The cause: 
  Tracking down the source of the issue, I found that the problem seems lie at 
the spark Yarn Client.scala. In the Client

def prepareLocalResources() method  L 266 of Client.java, the following line 
code is called. 

 YarnSparkHadoopUtil.get.obtainTokensForNamenodes(nns, hadoopConf, credentials)

The YarnSparkHadoopUtil.get is in turns get initialized via reflection


object SparkHadoopUtil {

  private val hadoop = {
val yarnMode = java.lang.Boolean.valueOf(
System.getProperty("SPARK_YARN_MODE", System.getenv("SPARK_YARN_MODE")))
if (yarnMode) {
  try {
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil")
  .newInstance()
  .asInstanceOf[SparkHadoopUtil]
  } catch {
   case e: Exception => throw new SparkException("Unable to load YARN 
support", e)
  }
} else {
  new SparkHadoopUtil
}
  } 

  def get: SparkHadoopUtil = {
hadoop
  }
}

 

class SparkHadoopUtil extends Logging {
  private val sparkConf = new SparkConf()
  val conf: Configuration = newConfiguration(sparkConf)
  UserGroupInformation.setConfiguration(conf)

    rest of line
}

Here SparkHadoopUtil creates a empty SparkConf and Hadoop Configuration from 
that and set to UserGroupInformation

  UserGroupInformation.setConfiguration(conf)


  As the UserGroupInformation.authenticationMethod is static, above all wipe 
out the security settings. UserGroupInformation.isSecurityEnabled() changed 
from true to false. Thus the sequence call will fail. 

 Since the SparkHadoopUtil.hadoop is static/non-mutable variable, so 
the next run it will be not create again, then 
UserGroupInformation.setConfiguration(conf) 
will not be called again, so the sequence spark job works. 

The work around: 
//first initialize the SparkHadoopUtil, which will create a static 
instance
//which will set UserGroupInformation to a empty hadoop Configuration.
//we will need to reset the UserGroupInformation after that.
val util = SparkHadoopUtil.get
UserGroupInformation.setConfiguration(hadoopConf)

  Then call 
  client.run() 


   



 



 


















--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12771) Improve code generation for CaseWhen

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12771:


Assignee: (was: Apache Spark)

> Improve code generation for CaseWhen
> 
>
> Key: SPARK-12771
> URL: https://issues.apache.org/jira/browse/SPARK-12771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The generated code for CaseWhen uses a control variable "got" to make sure we 
> do not evaluate more branches once a branch is true. Changing that to 
> generate just simple "if / else" would be slightly more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12771) Improve code generation for CaseWhen

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12771:


Assignee: Apache Spark

> Improve code generation for CaseWhen
> 
>
> Key: SPARK-12771
> URL: https://issues.apache.org/jira/browse/SPARK-12771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> The generated code for CaseWhen uses a control variable "got" to make sure we 
> do not evaluate more branches once a branch is true. Changing that to 
> generate just simple "if / else" would be slightly more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12771) Improve code generation for CaseWhen

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095790#comment-15095790
 ] 

Apache Spark commented on SPARK-12771:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10737

> Improve code generation for CaseWhen
> 
>
> Key: SPARK-12771
> URL: https://issues.apache.org/jira/browse/SPARK-12771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The generated code for CaseWhen uses a control variable "got" to make sure we 
> do not evaluate more branches once a branch is true. Changing that to 
> generate just simple "if / else" would be slightly more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12635) More efficient (column batch) serialization for Python/R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095494#comment-15095494
 ] 

Sun Rui commented on SPARK-12635:
-

[~dselivanov] PySpark uses pickle and CloudPickle on python side and 
net.razorvine.pickle on JVM side for data serialization/deserialization between 
Python and JVM. While there lacks a library similar to net.razorvine.pickle 
which can deserialize from and serialize to R serialization format. So 
currently, SparkR depends on ReadBin()/writeBin() on R side and 
DataInputStream/DataOutputStream for serialization/deserialization between R 
and JVM, based on the fact that for simple types like integer, double, array 
byte, they shares the same format.

For collect(), the serialization/deserialization happens along with the 
communication via socket. I suspect there are much communication overhead 
occurring during many socket reads/writes.  Maybe we can change the behavior in 
batch way, that is, serialize part of the collection result into a buffer in 
memory and transfer it back. Would you interested in doing a prototype and see 
if there is any performance improvement?

Another idea would be introduce something like net.razorvine.pickle, but that 
sounds a lot of effort.

> More efficient (column batch) serialization for Python/R
> 
>
> Key: SPARK-12635
> URL: https://issues.apache.org/jira/browse/SPARK-12635
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SparkR, SQL
>Reporter: Reynold Xin
>
> Serialization between Scala / Python / R is pretty slow. Python and R both 
> work pretty well with column batch interface (e.g. numpy arrays). Technically 
> we should be able to just pass column batches around with minimal 
> serialization (maybe even zero copy memory).
> Note that this depends on some internal refactoring to use a column batch 
> interface in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12635) More efficient (column batch) serialization for Python/R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095494#comment-15095494
 ] 

Sun Rui edited comment on SPARK-12635 at 1/13/16 2:35 AM:
--

[~dselivanov] PySpark uses pickle and CloudPickle on python side and 
net.razorvine.pickle on JVM side for data serialization/deserialization between 
Python and JVM. While there lacks a library similar to net.razorvine.pickle 
which can deserialize from and serialize to R serialization format. So 
currently, SparkR depends on ReadBin()/writeBin() on R side and Java 
DataInputStream/DataOutputStream for serialization/deserialization between R 
and JVM, based on the fact that for simple types like integer, double, byte 
array, they share the same format.

For collect(), the serialization/deserialization happens along with the 
communication via socket. I suspect there are much communication overhead 
occurring during many socket reads/writes.  Maybe we can change the behavior in 
batch way, that is, serialize part of the collection result into a buffer in 
memory and transfer it back. Would you interested in doing a prototype and see 
if there is any performance improvement?

Another idea would be introduce something like net.razorvine.pickle, but that 
sounds a lot of effort.


was (Author: sunrui):
[~dselivanov] PySpark uses pickle and CloudPickle on python side and 
net.razorvine.pickle on JVM side for data serialization/deserialization between 
Python and JVM. While there lacks a library similar to net.razorvine.pickle 
which can deserialize from and serialize to R serialization format. So 
currently, SparkR depends on ReadBin()/writeBin() on R side and 
DataInputStream/DataOutputStream for serialization/deserialization between R 
and JVM, based on the fact that for simple types like integer, double, array 
byte, they shares the same format.

For collect(), the serialization/deserialization happens along with the 
communication via socket. I suspect there are much communication overhead 
occurring during many socket reads/writes.  Maybe we can change the behavior in 
batch way, that is, serialize part of the collection result into a buffer in 
memory and transfer it back. Would you interested in doing a prototype and see 
if there is any performance improvement?

Another idea would be introduce something like net.razorvine.pickle, but that 
sounds a lot of effort.

> More efficient (column batch) serialization for Python/R
> 
>
> Key: SPARK-12635
> URL: https://issues.apache.org/jira/browse/SPARK-12635
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SparkR, SQL
>Reporter: Reynold Xin
>
> Serialization between Scala / Python / R is pretty slow. Python and R both 
> work pretty well with column batch interface (e.g. numpy arrays). Technically 
> we should be able to just pass column batches around with minimal 
> serialization (maybe even zero copy memory).
> Note that this depends on some internal refactoring to use a column batch 
> interface in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12788) Simplify BooleanEquality by using casts

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12788.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Simplify BooleanEquality by using casts
> ---
>
> Key: SPARK-12788
> URL: https://issues.apache.org/jira/browse/SPARK-12788
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12793) Support R UDF Evaluation

2016-01-12 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12793:
---

 Summary: Support R UDF Evaluation
 Key: SPARK-12793
 URL: https://issues.apache.org/jira/browse/SPARK-12793
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.6.0
Reporter: Sun Rui


Basically follows the logic as that for Python UDF evaluation 
(org/apache/spark/sql/execution/python.scala). Will extract and reuse common 
logic between Python UDF and R UDF evaluation.
Serialization/deserialization is different from Python UDF. R UDF will use R 
SerDe to directly serialize a batch of InternalRows into bytes (that is, in 
SerializationFormats.ROW format)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12794) Support Defining and Registration of R UDF

2016-01-12 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12794:
---

 Summary: Support Defining and Registration of R UDF
 Key: SPARK-12794
 URL: https://issues.apache.org/jira/browse/SPARK-12794
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.6.0
Reporter: Sun Rui


Create UserDefinedRFunction class in Scala similar to UserDefinedPythonFunction 
class.
Support registering R UDF in UDFRegistration class.
Implement udf() function in functions.R.
Implement registerFunction() in SQLContext.R.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12692) Scala style: check no white space before comma and colon

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095714#comment-15095714
 ] 

Apache Spark commented on SPARK-12692:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/10736

> Scala style: check no white space before comma and colon
> 
>
> Key: SPARK-12692
> URL: https://issues.apache.org/jira/browse/SPARK-12692
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>
> We should not put a white space before `,` and `:` so let's check it.
> Because there are lots of style violation, first, I'd like to add a checker, 
> enable  and let the level `warn`.
> Then, I'd like to fix the style step by step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095747#comment-15095747
 ] 

Reynold Xin commented on SPARK-6817:


Please take a look at the original design doc for this: 
https://docs.google.com/document/d/1xa8gB705QFybQD7qEe-NcZZOtkfA1YY-eVhyaXtAtOM/edit



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095745#comment-15095745
 ] 

Sun Rui commented on SPARK-6817:


[~rxin] Row-oriented R UDF is for SQL and is similar to Python UDF. I am not 
making the R UDF depends on RRDD, but abstract the re-usable logic that can be 
shared by RRDD and R UDF, which is also similar to Python UDF. 
I don't know what the block means in "block oriented API"? something like 
GroupedData? I think that depends on UDAF support, which will be supported 
after UDF support. Maybe something I mis-understand?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12799) Simplify various string output for expressions

2016-01-12 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12799:
---

 Summary: Simplify various string output for expressions
 Key: SPARK-12799
 URL: https://issues.apache.org/jira/browse/SPARK-12799
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


We currently have "sql", "prettyString", "toString".

The default implementation of "prettyString" is simply "toString" but replaced 
the AttributeReferences with PrettyAttributes. I think we can just remove the 
existing "sql" one, and rename "prettyString" to "sql". We might need to do a 
little bit cleanup to make the prettyString work.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12728) Integrate SQL generation feature with native view

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095478#comment-15095478
 ] 

Apache Spark commented on SPARK-12728:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10733

> Integrate SQL generation feature with native view
> -
>
> Key: SPARK-12728
> URL: https://issues.apache.org/jira/browse/SPARK-12728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12791:


Assignee: Reynold Xin  (was: Apache Spark)

> Simplify CaseWhen by breaking "branches" into "conditions" and "values"
> ---
>
> Key: SPARK-12791
> URL: https://issues.apache.org/jira/browse/SPARK-12791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"

2016-01-12 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12791:
---

 Summary: Simplify CaseWhen by breaking "branches" into 
"conditions" and "values"
 Key: SPARK-12791
 URL: https://issues.apache.org/jira/browse/SPARK-12791
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095695#comment-15095695
 ] 

Apache Spark commented on SPARK-4226:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10706

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12795) Whole stage codegen

2016-01-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12795:
---
Description: 
Whole stage codegen is used by some modern MPP databases to archive great 
performance. See http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

For Spark SQL, we can compile multiple operator into a single Java function to 
avoid the overhead from materialize rows and Scala iterator.

  was:Compile multiple operator into a single Java function to avoid the 
overhead from materialize rows and Scala iterator


> Whole stage codegen
> ---
>
> Key: SPARK-12795
> URL: https://issues.apache.org/jira/browse/SPARK-12795
> Project: Spark
>  Issue Type: Epic
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Whole stage codegen is used by some modern MPP databases to archive great 
> performance. See http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
> For Spark SQL, we can compile multiple operator into a single Java function 
> to avoid the overhead from materialize rows and Scala iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776
 ] 

Antonio Piccolboni commented on SPARK-6817:
---

My question made sense only wrt the block or vectorized design. If you are 
implementing plain-vanilla UDFs in R, my questions is meaningless. The 
performance implications of calling an R function for each row are ominous so I 
am not sure why you are going down this path. Imagine you want to add a column 
with random numbers from a distribution. You can use a regular UDF on each row 
or a block UDF on a block of a million rows. That means a single R call vs a 
million.

system.time(rnorm(10^6))
   user  system elapsed 
  0.089   0.002   0.092 
> z = rep_len(1, 10^6); system.time(sapply(z, rnorm))
   user  system elapsed 
  4.272   0.317   4.588 

That's 45 times slower. Plus R is choke full of vectorized functions. There are 
no builtin scalar types  in R. So there are plenty of examples of block UDF 
that one can write in R efficiently (no interpreter loops of any sort.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12795) Whole stage codegen

2016-01-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12795:
---
Summary: Whole stage codegen  (was: Compile multiple operator into a single 
Java function to avoid the overhead from materialize rows and Scala iterator)

> Whole stage codegen
> ---
>
> Key: SPARK-12795
> URL: https://issues.apache.org/jira/browse/SPARK-12795
> Project: Spark
>  Issue Type: Epic
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Compile multiple operator into a single Java function to avoid the overhead 
> from materialize rows and Scala iterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12795) Compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator

2016-01-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12795:
--

 Summary: Compile multiple operator into a single Java function to 
avoid the overhead from materialize rows and Scala iterator
 Key: SPARK-12795
 URL: https://issues.apache.org/jira/browse/SPARK-12795
 Project: Spark
  Issue Type: Epic
Reporter: Davies Liu
Assignee: Davies Liu


Compile multiple operator into a single Java function to avoid the overhead 
from materialize rows and Scala iterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM:
-

[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't want the UDF to depend on 
RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.




was (Author: rxin):
[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes a lot more sense. I also don't want the UDF to 
depend on RRDD, because we are going to remove RRDD from Spark once the UDFs 
are implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin commented on SPARK-6817:


[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes a lot more sense. I also don't want the UDF to 
depend on RRDD, because we are going to remove RRDD from Spark once the UDFs 
are implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776
 ] 

Antonio Piccolboni edited comment on SPARK-6817 at 1/13/16 7:41 AM:


My question made sense only wrt the block or vectorized design. If you are 
implementing plain-vanilla UDFs in R, my questions is meaningless. The 
performance implications of calling an R function for each row are ominous so I 
am not sure why you are going down this path. Imagine you want to add a column 
with random numbers from a distribution. You can use a regular UDF on each row 
or a block UDF on a block of a million rows. That means a single R call vs a 
million.

system.time(rnorm(10^6))
   user  system elapsed 
  0.089   0.002   0.092 
> z = rep_len(1, 10^6); system.time(sapply(z, rnorm))
   user  system elapsed 
  4.272   0.317   4.588 

That's 45 times slower. Plus R is choke full of vectorized functions. There are 
no builtin scalar types  in R. So there are plenty of examples of block UDF 
that one can write in R efficiently (no interpreter loops of any sort).


was (Author: piccolbo):
My question made sense only wrt the block or vectorized design. If you are 
implementing plain-vanilla UDFs in R, my questions is meaningless. The 
performance implications of calling an R function for each row are ominous so I 
am not sure why you are going down this path. Imagine you want to add a column 
with random numbers from a distribution. You can use a regular UDF on each row 
or a block UDF on a block of a million rows. That means a single R call vs a 
million.

system.time(rnorm(10^6))
   user  system elapsed 
  0.089   0.002   0.092 
> z = rep_len(1, 10^6); system.time(sapply(z, rnorm))
   user  system elapsed 
  4.272   0.317   4.588 

That's 45 times slower. Plus R is choke full of vectorized functions. There are 
no builtin scalar types  in R. So there are plenty of examples of block UDF 
that one can write in R efficiently (no interpreter loops of any sort.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12770) Implement rules for branch elimination for CaseWhen in SimplifyConditionals

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12770:

Description: 
There are a few things we can do:

1. If a branch's condition is a true literal, remove the CaseWhen and use the 
value from that branch.

2. If a branch's condition is a false or null literal, remove that branch.

3. If only the else branch is left, remove the CaseWhen and use the value from 
the else branch.



  was:
There are a few things we can do:

1. If a branch is a true literal, remove the CaseWhen and use the value from 
that branch.

2. If a branch is a literal that is false or null, remove that branch.

3. If only the else branch is left, remove the CaseWhen and use the value from 
the else branch.




> Implement rules for branch elimination for CaseWhen in SimplifyConditionals
> ---
>
> Key: SPARK-12770
> URL: https://issues.apache.org/jira/browse/SPARK-12770
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>
> There are a few things we can do:
> 1. If a branch's condition is a true literal, remove the CaseWhen and use the 
> value from that branch.
> 2. If a branch's condition is a false or null literal, remove that branch.
> 3. If only the else branch is left, remove the CaseWhen and use the value 
> from the else branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-01-12 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093533#comment-15093533
 ] 

Santiago M. Mola commented on SPARK-12449:
--

Implementing this interface or an equivalent one would help standarize a lot of 
advanced features that data sources have been doing for some time. And while 
doing so, it would prevent them from creating their own SQLContext variants or 
patching the running SQLContext at runtime (using extraStrategies).

Here's a list of data source that are currently this approach. It would also be 
good to take them into account for this JIRA. The proposed interface and 
strategy should probably support all of these use cases. Some of them also use 
their own catalog implementation, but that should be something for a separate 
JIRA.

*spark-sql-on-hbase*

Already mentioned by [~yzhou2001]. They are using HBaseContext with 
extraStrategies that inject HBaseStrategies doing aggregation push down:
https://github.com/Huawei-Spark/Spark-SQL-on-HBase/blob/master/src/main/scala/org/apache/spark/sql/hbase/execution/HBaseStrategies.scala

*memsql-spark-connector*

They offer both their own SQLContext or inject their MemSQL-specific push down 
strategy on runtime.
They do match Catalyst's LogicalPlan in the same way we're proposing to push 
down filters, projects, aggregates, limits, sorts and joins:
https://github.com/memsql/memsql-spark-connector/blob/master/connectorLib/src/main/scala/com/memsql/spark/pushdown/MemSQLPushdownStrategy.scala

*spark-iqmulus*

Strategy injected to push down counts and some aggregates:

https://github.com/IGNF/spark-iqmulus/blob/master/src/main/scala/fr/ign/spark/iqmulus/ExtraStrategies.scala

*druid-olap*

They use SparkPlanner, Strategy and LogicalPlan APIs to do extensive push down. 
Their API usage could be limited to LogicalPlan only if this JIRA is 
implemented:

https://github.com/SparklineData/spark-druid-olap/blob/master/src/main/scala/org/apache/spark/sql/sources/druid/

*magellan* _(probably out of scope)_

Does its own BroadcastJoin. Although, it seems to me that this usage would be 
out of scope for us.

https://github.com/harsha2010/magellan/blob/master/src/main/scala/magellan/execution/MagellanStrategies.scala

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12771) Improve code generation for CaseWhen

2016-01-12 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12771:
---

 Summary: Improve code generation for CaseWhen
 Key: SPARK-12771
 URL: https://issues.apache.org/jira/browse/SPARK-12771
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


The generated code for CaseWhen uses a control variable "got" to make sure we 
do not evaluate more branches once a branch is true. Changing that to generate 
just simple "if / else" would be slightly more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-01-12 Thread Konstantin Shaposhnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093511#comment-15093511
 ] 

Konstantin Shaposhnikov commented on SPARK-2984:


I am seeing the same error message with Spark 1.6 and HDFS. This happens after 
an earlier job failure (ClassCastException)

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
> Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
>   at 
> 

[jira] [Created] (SPARK-12772) Better error message for parsing failure?

2016-01-12 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12772:
---

 Summary: Better error message for parsing failure?
 Key: SPARK-12772
 URL: https://issues.apache.org/jira/browse/SPARK-12772
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


{code}
scala> sql("select case if(true, 'one', 'two')").explain(true)
org.apache.spark.sql.AnalysisException: org.antlr.runtime.EarlyExitException
line 1:34 required (...)+ loop did not match anything at input '' in case 
expression
; line 1 pos 34

at 
org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:140)
at 
org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:129)
at 
org.apache.spark.sql.catalyst.parser.ParseDriver$.parse(ParseDriver.scala:77)
at 
org.apache.spark.sql.catalyst.CatalystQl.createPlan(CatalystQl.scala:53)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)

{code}

Is there a way to say something better other than "required (...)+ loop did not 
match anything at input"?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12689) Migrate DDL parsing to the newly absorbed parser

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12689:


Assignee: (was: Apache Spark)

> Migrate DDL parsing to the newly absorbed parser
> 
>
> Key: SPARK-12689
> URL: https://issues.apache.org/jira/browse/SPARK-12689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12770) Implement rules for branch elimination for CaseWhen in SimplifyConditionals

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12770:

Summary: Implement rules for branch elimination for CaseWhen in 
SimplifyConditionals  (was: Implement rules for removing unnecessary branches 
for CaseWhen in SimplifyConditionals)

> Implement rules for branch elimination for CaseWhen in SimplifyConditionals
> ---
>
> Key: SPARK-12770
> URL: https://issues.apache.org/jira/browse/SPARK-12770
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>
> There are a few things we can do:
> 1. If a branch is a true literal, remove the CaseWhen and use the value from 
> that branch.
> 2. If a branch is a literal that is false or null, remove that branch.
> 3. If only the else branch is left, remove the CaseWhen and use the value 
> from the else branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12770) Implement rules for removing unnecessary branches for CaseWhen in SimplifyConditionals

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12770:

Description: 
There are a few things we can do:

1. If a branch is a true literal, remove the CaseWhen and use the value from 
that branch.

2. If a branch is a literal that is false or null, remove that branch.

3. If only the else branch is left, remove the CaseWhen and use the value from 
the else branch.



> Implement rules for removing unnecessary branches for CaseWhen in 
> SimplifyConditionals
> --
>
> Key: SPARK-12770
> URL: https://issues.apache.org/jira/browse/SPARK-12770
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>
> There are a few things we can do:
> 1. If a branch is a true literal, remove the CaseWhen and use the value from 
> that branch.
> 2. If a branch is a literal that is false or null, remove that branch.
> 3. If only the else branch is left, remove the CaseWhen and use the value 
> from the else branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12768) Remove CaseKeyWhen expression

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12768:

Summary: Remove CaseKeyWhen expression  (was: Remove CaseKeyWhen)

> Remove CaseKeyWhen expression
> -
>
> Key: SPARK-12768
> URL: https://issues.apache.org/jira/browse/SPARK-12768
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> CaseKeyWhen was added to improve the performance of "case a when ..." when we 
> did not have common subexpression elimination. We now have that so we can 
> remove CaseKeyWhen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12762) Add unit test for simplifying if expression

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12762:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-12767

> Add unit test for simplifying if expression
> ---
>
> Key: SPARK-12762
> URL: https://issues.apache.org/jira/browse/SPARK-12762
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12773) Impurity and Sample details for each node of a decision tree

2016-01-12 Thread Rahul Tanwani (JIRA)
Rahul Tanwani created SPARK-12773:
-

 Summary: Impurity and Sample details for each node of a decision 
tree
 Key: SPARK-12773
 URL: https://issues.apache.org/jira/browse/SPARK-12773
 Project: Spark
  Issue Type: Question
  Components: ML, MLlib
Affects Versions: 1.5.2
Reporter: Rahul Tanwani


I just want to understand if each node in the decision tree calculates / stores 
information about no. of samples that satisfy the split criteria. Looking at 
the code, I find some information about the impurity statistics but did not 
find anything on the samples. Sci-kit learn exposes both of these metrics. The 
information may help in the cases where there are multiple decision rules 
(multiple leaf nodes) yielding the same prediction and we want to do some 
relative comparisions of decision paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12774) DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows

2016-01-12 Thread Josh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh updated SPARK-12774:
-
Description: 
Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in 
both Spark and pySpark. The function that is applied to each partition _f_ must 
operate on a list generator. This is however very inefficient in Python. It 
would be more logical and efficient if the apply function _f_  operated on 
Pandas DataFrames instead and also returned a DataFrame. This avoids 
unnecessary iteration in Python which is slow.

Currently:
{code}
def apply_function(rows):
df = pd.DataFrame(list(rows))
df = df % 100   # Do something on df
return df.values.tolist()

table = sqlContext.read.parquet("")
table = table.mapPatitions(apply_function)
{code}

New apply function would accept a Pandas DataFrame and return a DataFrame:
{code}
def apply_function(df):
df = df % 100   # Do something on df
return df
{code}

  was:
Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in 
both Spark and pySpark. The function that is applied to each partition _f_ must 
operate on a list generator. This is however very inefficient in Python. It 
would be more logical and efficient if the apply function _f_  operated on 
Pandas DataFrames instead and also returned a DataFrame. This avoids 
unnecessary iteration in Python which is slow.

Currently:
{code:python}
def apply_function(rows):
df = pd.DataFrame(list(rows))
df = df % 100   # Do something on df
return df.values.tolist()

table = sqlContext.read.parquet("")
table = table.mapPatitions(apply_function)
{code}

New apply function would accept a Pandas DataFrame and return a DataFrame:
{code:python}
def apply_function(df):
df = df % 100   # Do something on df
return df
{code}


> DataFrame.mapPartitions apply function operates on Pandas DataFrame instead 
> of a generator or rows
> --
>
> Key: SPARK-12774
> URL: https://issues.apache.org/jira/browse/SPARK-12774
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Josh
>  Labels: dataframe, pandas
>
> Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions 
> in both Spark and pySpark. The function that is applied to each partition _f_ 
> must operate on a list generator. This is however very inefficient in Python. 
> It would be more logical and efficient if the apply function _f_  operated on 
> Pandas DataFrames instead and also returned a DataFrame. This avoids 
> unnecessary iteration in Python which is slow.
> Currently:
> {code}
> def apply_function(rows):
> df = pd.DataFrame(list(rows))
> df = df % 100   # Do something on df
> return df.values.tolist()
> table = sqlContext.read.parquet("")
> table = table.mapPatitions(apply_function)
> {code}
> New apply function would accept a Pandas DataFrame and return a DataFrame:
> {code}
> def apply_function(df):
> df = df % 100   # Do something on df
> return df
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12774) DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows

2016-01-12 Thread Josh (JIRA)
Josh created SPARK-12774:


 Summary: DataFrame.mapPartitions apply function operates on Pandas 
DataFrame instead of a generator or rows
 Key: SPARK-12774
 URL: https://issues.apache.org/jira/browse/SPARK-12774
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh


Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in 
both Spark and pySpark. The function that is applied to each partition _f_ must 
operate on a list generator. This is however very inefficient in Python. It 
would be more logical and efficient if the apply function _f_  operated on 
Pandas DataFrames instead and also returned a DataFrame. This avoids 
unnecessary iteration in Python which is slow.

Currently:
{code:python}
def apply_function(rows):
df = pd.DataFrame(list(rows))
df = df % 100   # Do something on df
return df.values.tolist()

table = sqlContext.read.parquet("")
table = table.mapPatitions(apply_function)
{code}

New apply function would accept a Pandas DataFrame and return a DataFrame:
{code:python}
def apply_function(df):
df = df % 100   # Do something on df
return df
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12689) Migrate DDL parsing to the newly absorbed parser

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12689:


Assignee: Apache Spark

> Migrate DDL parsing to the newly absorbed parser
> 
>
> Key: SPARK-12689
> URL: https://issues.apache.org/jira/browse/SPARK-12689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12689) Migrate DDL parsing to the newly absorbed parser

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093672#comment-15093672
 ] 

Apache Spark commented on SPARK-12689:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10723

> Migrate DDL parsing to the newly absorbed parser
> 
>
> Key: SPARK-12689
> URL: https://issues.apache.org/jira/browse/SPARK-12689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12769) Remove If expression

2016-01-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12769:

Description: 
If can be a simple factory method for CaseWhen, similar to CaseKeyWhen.

We can then simplify the optimizer rules we implement for conditional 
expressions.


  was:
If can be a simple factory method for CaseWhen. We can then simplify the 
optimizer rules we implement for conditional expressions.



> Remove If expression
> 
>
> Key: SPARK-12769
> URL: https://issues.apache.org/jira/browse/SPARK-12769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> If can be a simple factory method for CaseWhen, similar to CaseKeyWhen.
> We can then simplify the optimizer rules we implement for conditional 
> expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12768) Remove CaseKeyWhen expression

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12768:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove CaseKeyWhen expression
> -
>
> Key: SPARK-12768
> URL: https://issues.apache.org/jira/browse/SPARK-12768
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> CaseKeyWhen was added to improve the performance of "case a when ..." when we 
> did not have common subexpression elimination. We now have that so we can 
> remove CaseKeyWhen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12768) Remove CaseKeyWhen expression

2016-01-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12768:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove CaseKeyWhen expression
> -
>
> Key: SPARK-12768
> URL: https://issues.apache.org/jira/browse/SPARK-12768
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> CaseKeyWhen was added to improve the performance of "case a when ..." when we 
> did not have common subexpression elimination. We now have that so we can 
> remove CaseKeyWhen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12768) Remove CaseKeyWhen expression

2016-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093506#comment-15093506
 ] 

Apache Spark commented on SPARK-12768:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10722

> Remove CaseKeyWhen expression
> -
>
> Key: SPARK-12768
> URL: https://issues.apache.org/jira/browse/SPARK-12768
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> CaseKeyWhen was added to improve the performance of "case a when ..." when we 
> did not have common subexpression elimination. We now have that so we can 
> remove CaseKeyWhen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12760:
--
  Priority: Minor  (was: Trivial)
Issue Type: Bug  (was: Question)
   Summary: inaccurate description for difference between local vs cluster 
mode in closure handling  (was: inaccurate description for difference between 
local vs cluster mode )

I think the example needs an update, but not for this reason. There's no 
separate "memory space" in local mode. It's one JVM. However it's undefined 
whether the copy of {{counter}} is the same or different in this case. 
Actually, I find a copy is serialized with the closure at this point so the 
result is still 0.

I think the explanation should be changed to say the result is undefined here, 
and could be 0 or not, and explain why. Do you want to try a PR?

> inaccurate description for difference between local vs cluster mode in 
> closure handling
> ---
>
> Key: SPARK-12760
> URL: https://issues.apache.org/jira/browse/SPARK-12760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Mortada Mehyar
>Priority: Minor
>
> In the spark documentation there's an example for illustrating how `local` 
> and `cluster` mode can differ 
> http://spark.apache.org/docs/latest/programming-guide.html#example
> " In local mode with a single JVM, the above code will sum the values within 
> the RDD and store it in counter. This is because both the RDD and the 
> variable counter are in the same memory space on the driver node." 
> However the above doesn't seem to be true. Even in `local` mode it seems like 
> the counter value should still be 0, because the variable will be summed up 
> in the executor memory space, but the final value in the driver memory space 
> is still 0. I tested this snippet and verified that in `local` mode the value 
> is indeed still 0. 
> Is the doc wrong or perhaps I'm missing something the doc is trying to say? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12766) Unshaded google guava classes in spark-network-common jar

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12766.
---
Resolution: Not A Problem

This is on purpose. Some Guava classes are used in the public Java API 
(unfortunately). This was rectified for Spark 2.x but you will find 
{{Optional}} and some dependent classes unshaded in 1.x.

> Unshaded google guava classes in spark-network-common jar
> -
>
> Key: SPARK-12766
> URL: https://issues.apache.org/jira/browse/SPARK-12766
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Jake Yoon
>Priority: Minor
>  Labels: build, sbt
>
> I found an unshaded google guava classes used internally in 
> spark-network-common while working with ElasticSearch.
> Following link discusses about duplicate dependencies conflict cause by guava 
> classes and how I solved the build conflict issue.
> https://discuss.elastic.co/t/exception-when-using-elasticsearch-spark-and-elasticsearch-core-together/38471/4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836
 ] 

Ewan Leith edited comment on SPARK-12764 at 1/12/16 12:53 PM:
--

What are you expecting it to do, output the XML as a string, or something else?

I doubt this will work, but you might try adding this code before the initial 
references to JDBC:

case object PostgresDialect extends JdbcDialect {
  override def canHandle(url: String): Boolean = 
url.startsWith("jdbc:postgresql")
  override def getCatalystType(
sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
if (typeName.contains("xml")) {
Some(StringType)
} else None
  }
}

JdbcDialects.registerDialect(PostgresDialect)



was (Author: ewanleith):
What are you expecting it to do, output the XML as a string, or something else?

> XML Column type is not supported
> 
>
> Key: SPARK-12764
> URL: https://issues.apache.org/jira/browse/SPARK-12764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
> Environment: Mac Os X El Capitan
>Reporter: Rajeshwar Gaini
>
> Hi All,
> I am using PostgreSQL database. I am using the following jdbc call to access 
> a customer table (customer_id int, event text, country text, content xml) in 
> my database.
> {code}
> val dataframe1 = sqlContext.load("jdbc", Map("url" -> 
> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres", 
> "dbtable" -> "customer"))
> {code}
> When i run above command in spark-shell i receive the following error.
> {code}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $iwC$$iwC$$iwC$$iwC.(:36)
>   at $iwC$$iwC$$iwC.(:38)
>   at $iwC$$iwC.(:40)
>   at $iwC.(:42)
>   at (:44)
>   at .(:48)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> 

[jira] [Commented] (SPARK-12775) Couldn't find leader offsets exception when hostname can't be resolved

2016-01-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093858#comment-15093858
 ] 

Sean Owen commented on SPARK-12775:
---

Hm, I don't think that's a spark problem though.

> Couldn't find leader offsets exception when hostname can't be resolved
> --
>
> Key: SPARK-12775
> URL: https://issues.apache.org/jira/browse/SPARK-12775
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Sebastian Piu
>Priority: Minor
>
> When hostname resolution fails for a broker an unclear/misleading error is 
> shown:
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for 
> Set([mytopic,0], [mytopic,18], [mytopic,12], [mytopic,6])
>   at 
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>   at 
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> Error above ocurred when a broker was added to the cluster and my machine 
> could not resolve its hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12582.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10526
[https://github.com/apache/spark/pull/10526]

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: yucai
> Fix For: 2.0.0, 1.6.1
>
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7615) MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7615.
--
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0
   1.6.1

Resolved by https://github.com/apache/spark/pull/10696

> MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero 
> 
>
> Key: SPARK-7615
> URL: https://issues.apache.org/jira/browse/SPARK-7615
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Eric Li
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In Word2VecModel, wordVecNorms may contains Euclidean Norm equals to zero. 
> This will cause incorrect calculation for cosine distance. when you do 
> cosineVec(ind) / wordVecNorms(ind). Cosine distance should be equal to 0 for 
> norm = 0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12773) Impurity and Sample details for each node of a decision tree

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12773.
---
  Resolution: Invalid
Target Version/s:   (was: 1.5.2)

Please ask questions at u...@spark.apache.org

> Impurity and Sample details for each node of a decision tree
> 
>
> Key: SPARK-12773
> URL: https://issues.apache.org/jira/browse/SPARK-12773
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Rahul Tanwani
>
> I just want to understand if each node in the decision tree calculates / 
> stores information about no. of samples that satisfy the split criteria. 
> Looking at the code, I find some information about the impurity statistics 
> but did not find anything on the samples. Sci-kit learn exposes both of these 
> metrics. The information may help in the cases where there are multiple 
> decision rules (multiple leaf nodes) yielding the same prediction and we want 
> to do some relative comparisions of decision paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12759) Spark should fail fast if --executor-memory is too small for spark to start

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12759:
--
Component/s: Spark Submit
 Spark Core

> Spark should fail fast if --executor-memory is too small for spark to start
> ---
>
> Key: SPARK-12759
> URL: https://issues.apache.org/jira/browse/SPARK-12759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Imran Rashid
>
> With the UnifiedMemoryManager, the minimum memory for executor and driver 
> JVMs was increased to 450MB.  There is code in {{UnifiedMemoryManager}} to 
> provide a helpful warning if less than that much memory is provided.
> However if you set {{--executor-memory}} to something less than that, from 
> the driver process you just see executor failures with no warning, since the 
> more meaningful errors are buried in the executor logs.  Eg., on Yarn, you see
> {noformat}
> 16/01/11 13:59:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_1452548703600_0001_01_02 on host: 
> imran-adhoc-2.vpc.cloudera.com. Exit status: 1. Diagnostics: Exception from 
> container-launch.
> Container id: container_1452548703600_0001_01_02
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> {noformat}
> Though there is already a message from {{UnifiedMemoryManager}} if there 
> isn't enough memory for the driver, as long as this is being changed it would 
> be nice if the message more clearly indicated the {{--driver-memory}} 
> configuration as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12763) Spark gets stuck executing SSB query

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12763:
--
Component/s: SQL

> Spark gets stuck executing SSB query
> 
>
> Key: SPARK-12763
> URL: https://issues.apache.org/jira/browse/SPARK-12763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Standalone cluster
>Reporter: Vadim Tkachenko
> Attachments: Spark shell - Details for Stage 5 (Attempt 0).pdf
>
>
> I am trying to emulate SSB load. Data generated with 
> https://github.com/Percona-Lab/ssb-dbgen
> generated size is with 1000 scale factor and converted to parquet format.
> Now there is a following script
> val pLineOrder = 
> sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/lineorder").cache()
> val pDate = sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/date").cache()
> val pPart = sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/part").cache()
> val pSupplier = 
> sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/supplier").cache()
> val pCustomer = 
> sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/customer").cache()
> pLineOrder.registerTempTable("lineorder")
> pDate.registerTempTable("date")
> pPart.registerTempTable("part")
> pSupplier.registerTempTable("supplier")
> pCustomer.registerTempTable("customer")
> query 
> val sql41=sqlContext.sql("select D_YEAR, C_NATION,sum(LO_REVENUE - 
> LO_SUPPLYCOST) as profit from date, customer, supplier, part, lineorder 
> where LO_CUSTKEY = C_CUSTKEYand LO_SUPPKEY = S_SUPPKEYand 
> LO_PARTKEY = P_PARTKEY   and LO_ORDERDATE = D_DATEKEYand C_REGION = 
> 'AMERICA'and S_REGION = 'AMERICA'and (P_MFGR = 'MFGR#1' or P_MFGR = 
> 'MFGR#2') group by D_YEAR, C_NATION order by D_YEAR, C_NATION")
> and 
> sql41.show()
> get stuck, at some point there is no progress and server is fully idle, but 
> Job is staying at the same stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2516) Bootstrapping

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2516.
--
Resolution: Won't Fix

This is the only one left under this umbrella; I assume it's stale, or really 
better implemented in ML. Feel free to reopen as a stand-alone task but I 
didn't see any activity on this.

> Bootstrapping
> -
>
> Key: SPARK-2516
> URL: https://issues.apache.org/jira/browse/SPARK-2516
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>
> Support re-sampling and bootstrap estimators in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3669) Extract IndexedRDD interface

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3669.
--
Resolution: Won't Fix

Resolved for now per parent discussion

> Extract IndexedRDD interface
> 
>
> Key: SPARK-3669
> URL: https://issues.apache.org/jira/browse/SPARK-3669
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3668) Support for arbitrary key types in IndexedRDD

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3668.
--
Resolution: Won't Fix

Resolved for now per parent discussion

> Support for arbitrary key types in IndexedRDD
> -
>
> Key: SPARK-3668
> URL: https://issues.apache.org/jira/browse/SPARK-3668
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4043) Add a flag for stopping threads of cancelled tasks if Thread.interrupt doesn't kill them

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4043.
--
Resolution: Won't Fix

I think this never went anywhere specific, so closing it

> Add a flag for stopping threads of cancelled tasks if Thread.interrupt 
> doesn't kill them
> 
>
> Key: SPARK-4043
> URL: https://issues.apache.org/jira/browse/SPARK-4043
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>
> While killing user code with Thread.stop can be risky, we might want to do it 
> for things like long-running SQL servers, where users have to be able to 
> cancel a query even if it's spinning in the CPU and they know the code 
> involved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3818) Graph coarsening

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3818.
--
Resolution: Won't Fix

> Graph coarsening
> 
>
> Key: SPARK-3818
> URL: https://issues.apache.org/jira/browse/SPARK-3818
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Listing 7 in the [GraphX OSDI 
> paper|http://ankurdave.com/dl/graphx-osdi14.pdf] contains pseudocode for a 
> coarsening operator that allows merging edges that satisfy an edge predicate, 
> collapsing vertices connected by merged edges. GraphX should provide an 
> implementation of this operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3360) Add RowMatrix.multiply(Vector)

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3360.
--
Resolution: Won't Fix

> Add RowMatrix.multiply(Vector)
> --
>
> Key: SPARK-3360
> URL: https://issues.apache.org/jira/browse/SPARK-3360
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sandy Ryza
>
> RowMatrix currently has multiply(Matrix), but multiply(Vector) would be 
> useful as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12638) Parameter explaination not very accurate for rdd function "aggregate"

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12638:
--
Assignee: Tommy Yu

> Parameter explaination not very accurate for rdd function "aggregate"
> -
>
> Key: SPARK-12638
> URL: https://issues.apache.org/jira/browse/SPARK-12638
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.5.2
>Reporter: Tommy Yu
>Assignee: Tommy Yu
>Priority: Trivial
> Fix For: 1.6.1, 2.0.0
>
>
> Currently, RDD function aggregate's parameter doesn't explain well, 
> especially parameter "zeroValue". 
> It's necessary to let junior scala user know that "zeroValue" attend both 
> "seqOp" and "combOp" phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12638) Parameter explaination not very accurate for rdd function "aggregate"

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12638.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10587
[https://github.com/apache/spark/pull/10587]

> Parameter explaination not very accurate for rdd function "aggregate"
> -
>
> Key: SPARK-12638
> URL: https://issues.apache.org/jira/browse/SPARK-12638
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.5.2
>Reporter: Tommy Yu
>Priority: Trivial
> Fix For: 2.0.0, 1.6.1
>
>
> Currently, RDD function aggregate's parameter doesn't explain well, 
> especially parameter "zeroValue". 
> It's necessary to let junior scala user know that "zeroValue" attend both 
> "seqOp" and "combOp" phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1521) Take character set size into account when compressing in-memory string columns

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1521.
--
Resolution: Won't Fix

I assume this is obsolete or else already implemented in some sense by tungsten

> Take character set size into account when compressing in-memory string columns
> --
>
> Key: SPARK-1521
> URL: https://issues.apache.org/jira/browse/SPARK-1521
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>  Labels: compression
>
> Quoted from [a blog 
> post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/]
>  from Facebook:
> bq. Strings dominate the largest tables in our warehouse and make up about 
> 80% of the columns across the warehouse, so optimizing compression for string 
> columns was important. By using a threshold on observed number of distinct 
> column values per stripe, we modified the ORCFile writer to apply dictionary 
> encoding to a stripe only when beneficial. Additionally, we sample the column 
> values and take the character set of the column into account, since a small 
> character set can be leveraged by codecs like Zlib for good compression and 
> dictionary encoding then becomes unnecessary or sometimes even detrimental if 
> applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-873) Add a way to specify rack topology in Mesos and standalone modes

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-873.
-
Resolution: Won't Fix

> Add a way to specify rack topology in Mesos and standalone modes
> 
>
> Key: SPARK-873
> URL: https://issues.apache.org/jira/browse/SPARK-873
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.8.0
>Reporter: Matei Zaharia
>
> Right now the YARN mode can look up rack information from YARN, but the 
> standalone and Mesos modes don't have any way of specifying rack topology. We 
> should have a pluggable script or config file that allows this. For the 
> standalone mode, we'd probably want the rack info to be known by the Master 
> rather than driver apps, and maybe the apps can get a cluster map when they 
> register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1515) Specialized ColumnTypes for Array, Map and Struct

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1515.
--
Resolution: Won't Fix

Assuming this is obsolete

> Specialized ColumnTypes for Array, Map and Struct
> -
>
> Key: SPARK-1515
> URL: https://issues.apache.org/jira/browse/SPARK-1515
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>  Labels: compression
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1614) Move Mesos protobufs out of TaskState

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1614.
--
Resolution: Won't Fix

> Move Mesos protobufs out of TaskState
> -
>
> Key: SPARK-1614
> URL: https://issues.apache.org/jira/browse/SPARK-1614
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 0.9.1
>Reporter: Shivaram Venkataraman
>Priority: Minor
>  Labels: Starter
>
> To isolate usage of Mesos protobufs it would be good to move them out of 
> TaskState into either a new class (MesosUtils ?) or 
> CoarseGrainedMesos{Executor, Backend}.
> This would allow applications to build Spark to run without including 
> protobuf from Mesos in their shaded jars.  This is one way to avoid protobuf 
> conflicts between Mesos and Hadoop 
> (https://issues.apache.org/jira/browse/MESOS-1203)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3055) Stack trace logged in driver on job failure is usually uninformative

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3055.
--
Resolution: Won't Fix

> Stack trace logged in driver on job failure is usually uninformative
> 
>
> Key: SPARK-3055
> URL: https://issues.apache.org/jira/browse/SPARK-3055
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Priority: Minor
>
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:5 
> failed 4 times, most recent failure: TID 24 on host hddn04.lsrc.duke.edu 
> failed for unknown reason
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> At a cursory glance, I would expect the stack trace to have something to be 
> where the task error occurred.  In fact it's where the driver became aware of 
> the error and decided to fail the job.  This has been a common point of 
> confusion among our customers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2359) Supporting common statistical functions in MLlib

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2359.
--
Resolution: Done

> Supporting common statistical functions in MLlib
> 
>
> Key: SPARK-2359
> URL: https://issues.apache.org/jira/browse/SPARK-2359
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Reynold Xin
>Assignee: Doris Xin
>
> This is originally proposed by [~falaki].
> This is a proposal for a new package within the Spark distribution to support 
> common statistical estimators. We think consolidating statistical related 
> functions in a separate package will help with readability of core source 
> code and encourage spark users to submit back their functions.
> Please see the initial design document here: 
> https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3172) Distinguish between shuffle spill on the map and reduce side

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3172.
--
Resolution: Won't Fix

> Distinguish between shuffle spill on the map and reduce side
> 
>
> Key: SPARK-3172
> URL: https://issues.apache.org/jira/browse/SPARK-3172
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-809) Give newly registered apps a set of executors right away

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-809.
-
Resolution: Won't Fix

I'm assuming this is WontFix at this point.

> Give newly registered apps a set of executors right away
> 
>
> Key: SPARK-809
> URL: https://issues.apache.org/jira/browse/SPARK-809
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Reporter: Matei Zaharia
>Priority: Minor
>
> Right now, newly connected apps in the standalone cluster will not set a good 
> defaultParallelism value if they create RDDs right after creating a 
> SparkContext, because the executorAdded calls are asynchronous and happen 
> after. It would be nice to wait for a few such calls before returning from 
> the scheduler initializer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5273) Improve documentation examples for LinearRegression

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5273.
--
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0
   1.6.1

Resolved by https://github.com/apache/spark/pull/10675

> Improve documentation examples for LinearRegression 
> 
>
> Key: SPARK-5273
> URL: https://issues.apache.org/jira/browse/SPARK-5273
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Dev Lakhani
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> In the document:
> https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html
> Under
> Linear least squares, Lasso, and ridge regression
> The suggested method to use LinearRegressionWithSGD.train()
> // Building the model
> val numIterations = 100
> val model = LinearRegressionWithSGD.train(parsedData, numIterations)
> is not ideal even for simple examples such as y=x. This should be replaced 
> with more real world parameters with step size:
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> or
> LinearRegressionWithSGD.train(input,100,0.0001)
> To create a reasonable MSE. It took me a while using the dev forum to learn 
> that the step size should be really small. Might help save someone the same 
> effort when learning mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12759) Spark should fail fast if --executor-memory is too small for spark to start

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12759:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Spark should fail fast if --executor-memory is too small for spark to start
> ---
>
> Key: SPARK-12759
> URL: https://issues.apache.org/jira/browse/SPARK-12759
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Imran Rashid
>Priority: Minor
>
> With the UnifiedMemoryManager, the minimum memory for executor and driver 
> JVMs was increased to 450MB.  There is code in {{UnifiedMemoryManager}} to 
> provide a helpful warning if less than that much memory is provided.
> However if you set {{--executor-memory}} to something less than that, from 
> the driver process you just see executor failures with no warning, since the 
> more meaningful errors are buried in the executor logs.  Eg., on Yarn, you see
> {noformat}
> 16/01/11 13:59:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_1452548703600_0001_01_02 on host: 
> imran-adhoc-2.vpc.cloudera.com. Exit status: 1. Diagnostics: Exception from 
> container-launch.
> Container id: container_1452548703600_0001_01_02
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> {noformat}
> Though there is already a message from {{UnifiedMemoryManager}} if there 
> isn't enough memory for the driver, as long as this is being changed it would 
> be nice if the message more clearly indicated the {{--driver-memory}} 
> configuration as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12765:
--
Fix Version/s: (was: 1.6.1)
   (was: 1.6.0)

[~sloth2012] don't set fix version; it doesn't make sense now.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> CountVectorizerModel.transform lost the transformSchema
> ---
>
> Key: SPARK-12765
> URL: https://issues.apache.org/jira/browse/SPARK-12765
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 1.6.1
>Reporter: sloth
>  Labels: patch
>
> In ml package , CountVectorizerModel forgot to do transformSchema in 
> transform function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2011) Eliminate duplicate join in Pregel

2016-01-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2011.
--
Resolution: Won't Fix

> Eliminate duplicate join in Pregel
> --
>
> Key: SPARK-2011
> URL: https://issues.apache.org/jira/browse/SPARK-2011
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Minor
>
> In the iteration loop, Pregel currently performs an innerJoin to apply 
> messages to vertices followed by an outerJoinVertices to join the resulting 
> subset of vertices back to the graph. These two operations could be merged 
> into a single call to joinVertices, which should be reimplemented in a more 
> efficient manner. This would allow us to examine only the vertices that 
> received messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >