[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2017-01-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17458:

Fix Version/s: 2.0.3

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Andrew Ray
> Fix For: 2.0.3, 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

2017-01-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17237:

Fix Version/s: 2.0.3

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -
>
> Key: SPARK-17237
> URL: https://issues.apache.org/jira/browse/SPARK-17237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jiang Qiqi
>Assignee: Takeshi Yamamuro
>  Labels: newbie
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+--++--++
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+--++--++
> |  2| 1| 4.0| 0| 0.0|
> |  3| 0| 0.0| 1| 5.0|
> +---+--++--++
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10890) "Column count does not match; SQL statement:" error in JDBCWriteSuite

2017-01-14 Thread Christian Kadner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Kadner updated SPARK-10890:
-
Description: 
I get the following error when I run the following test...

mvn -Dhadoop.version=2.4.0 
-DwildcardSuites=org.apache.spark.sql.jdbc.JDBCWriteSuite test

{noformat}
JDBCWriteSuite:
13:22:15.603 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
13:22:16.506 WARN org.apache.spark.metrics.MetricsSystem: Using default name 
DAGScheduler for source because spark.app.id is not set.
- Basic CREATE
- CREATE with overwrite
- CREATE then INSERT to append
- CREATE then INSERT to truncate
13:22:19.312 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in 
stage 23.0 (TID 31)
org.h2.jdbc.JdbcSQLException: Column count does not match; SQL statement:
INSERT INTO TEST.INCOMPATIBLETEST VALUES (?, ?, ?) [21002-183]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.message.DbException.get(DbException.java:144)
at org.h2.command.dml.Insert.prepare(Insert.java:265)
at org.h2.command.Parser.prepareCommand(Parser.java:247)
at org.h2.engine.Session.prepareLocal(Session.java:446)
at org.h2.engine.Session.prepareCommand(Session.java:388)
at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1189)
at 
org.h2.jdbc.JdbcPreparedStatement.(JdbcPreparedStatement.java:72)
at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:277)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:72)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:100)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:229)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$32.apply(RDD.scala:892)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$32.apply(RDD.scala:892)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1856)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1856)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
13:22:19.312 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in 
stage 23.0 (TID 32)
org.h2.jdbc.JdbcSQLException: Column count does not match; SQL statement:
INSERT INTO TEST.INCOMPATIBLETEST VALUES (?, ?, ?) [21002-183]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.message.DbException.get(DbException.java:144)
at org.h2.command.dml.Insert.prepare(Insert.java:265)
at org.h2.command.Parser.prepareCommand(Parser.java:247)
at org.h2.engine.Session.prepareLocal(Session.java:446)
at org.h2.engine.Session.prepareCommand(Session.java:388)
at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1189)
at 
org.h2.jdbc.JdbcPreparedStatement.(JdbcPreparedStatement.java:72)
at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:277)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:72)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:100)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:229)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$32.apply(RDD.scala:892)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$32.apply(RDD.scala:892)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1856)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1856)
at 

[jira] [Created] (SPARK-19225) Spark SQL round constant double return null

2017-01-14 Thread discipleforteen (JIRA)
discipleforteen created SPARK-19225:
---

 Summary: Spark SQL round constant double return null 
 Key: SPARK-19225
 URL: https://issues.apache.org/jira/browse/SPARK-19225
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: discipleforteen
 Fix For: 2.1.1


Spark SQL round constant double value may return null . like 'select round(4.4, 
2)' return null in spark 2.x. it's not compatible with spark 1.x version. it 
seems 4.4 is casted to decimal with new SqlBase.g4 , which is casted to double 
in spark 1.x version.  round decimal 4.4, 2 get null in changeprecision...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2017-01-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823010#comment-15823010
 ] 

Apache Spark commented on SPARK-17458:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16565

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Andrew Ray
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-14 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823007#comment-15823007
 ] 

Takeshi Yamamuro commented on SPARK-19217:
--

Would it be better that `cast` support this conversion? IMO implementing the 
UDF to support this conversion is better because the implementation seems to be 
easy and the design be simple, I think.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19222) Limit Query Performance issue

2017-01-14 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823005#comment-15823005
 ] 

Takeshi Yamamuro commented on SPARK-19222:
--

In those cases, "sample + select" is not enough?
IMO, in most cases, "limit" is used along with "sort" for operations like top-k 
searching.
In this "sort + limit" case, the approach you described does not work, I think.

> Limit Query Performance issue
> -
>
> Key: SPARK-19222
> URL: https://issues.apache.org/jira/browse/SPARK-19222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Linux/Windows
>Reporter: Sujith
>Priority: Minor
>
> Performance/memory bottle neck occurs in the below mentioned query
> case 1:
> create table t1 as select * from dest1 limit 1000;
> case 2:
> create table t1 as select * from dest1 limit 1000;
> pre-condition : partition count >=1
> In above cases limit is being added in the terminal of the physical plan 
> == Physical Plan  ==
> ExecutedCommand
>+- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
> InsertIntoHiveTable]
>  +- GlobalLimit 1000
> +- LocalLimit 1000
>+- Project [imei#101, age#102, task#103L, num#104, level#105, 
> productdate#106, name#107, point#108]
>   +- SubqueryAlias hive
>  +- 
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
>  csv  |
> Issue Hints: 
> Possible Bottleneck snippet in limit.scala file under spark-sql package.
>   protected override def doExecute(): RDD[InternalRow] = {
> val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
> val shuffled = new ShuffledRowRDD(
>   ShuffleExchange.prepareShuffleDependency(
> locallyLimited, child.output, SinglePartition, serializer))
> shuffled.mapPartitionsInternal(_.take(limit))
>   }
> As mentioned in above case 1  (where limit value is 1000 or partition 
> count is > 1) and case 2(limit value is small(around 1000)), As per the 
> above snippet when the ShuffledRowRDD
> is created by grouping all the limit data from different partitions to a 
> single partition in executer,  memory issue occurs since all the partition 
> limit data will be collected and 
> grouped  in a single partition for processing, in both former/later case the 
> data count  can go very high which can create the memory bottleneck.
> Proposed solution for case 2:
> An accumulator value can be to send to all partitions, all executor will be 
> updating the accumulator value based on the  data fetched , 
> eg: Number of partition = 100, number of cores =10
> Ideally tasks will be launched in a group of 10 task/core, once the first 
> group finishes the tasks driver will check whether the accumulator value is 
> been reached the limit value if its reached then no further tasks will be 
> launched to executors and the result after applying limit will be returned.
> Please let me now for any suggestions or solutions for the above mentioned 
> problems
> Thanks,
> Sujith



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19005) Keep column ordering when a schema is explicitly specified

2017-01-14 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-19005.

Resolution: Won't Fix

>  Keep column ordering when a schema is explicitly specified
> ---
>
> Key: SPARK-19005
> URL: https://issues.apache.org/jira/browse/SPARK-19005
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket is to keep column ordering when a schema is explicitly specified.
> A concrete example is as follows;
> {code}
> scala> import org.apache.spark.sql.types._
> scala> case class A(a: Long, b: Int)
> scala> val as = Seq(A(1, 2))
> scala> 
> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/")
> scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> scala> val schema = new StructType().add("a", LongType).add("b", IntegerType)
> scala> val df = 
> spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/")
> scala> df.printSchema
> root
>  |-- b: integer (nullable = true)
>  |-- a: long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-01-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822940#comment-15822940
 ] 

Nicholas Chammas commented on SPARK-18492:
--

Actually, on second look, I'm not entirely sure about that. It seems there are 
potentially multiple issues related to a 64KB code gen limit.

[~srowen] / [~cloud_fan]: I noticed you commented on the linked issues. Would 
you happen to know if this issue is covered by one of them?

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-01-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822939#comment-15822939
 ] 

Nicholas Chammas commented on SPARK-18492:
--

Oh, it looks like this issue is duplicated by SPARK-15285 and SPARK-16845, and 
a fix for this will be included in Spark 2.1.1 and 2.2.0.

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */   if (!project_isNull252) {
> /* 12270 */ 

[jira] [Created] (SPARK-19224) [PYSPARK] Python tests organization

2017-01-14 Thread Saikat Kanjilal (JIRA)
Saikat Kanjilal created SPARK-19224:
---

 Summary: [PYSPARK] Python tests organization
 Key: SPARK-19224
 URL: https://issues.apache.org/jira/browse/SPARK-19224
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 2.1.0
 Environment: Specific to pyspark
Reporter: Saikat Kanjilal
 Fix For: 2.2.0


Move all pyspark tests to packages and separating into modules reflecting 
project structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-01-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822934#comment-15822934
 ] 

Nicholas Chammas commented on SPARK-18492:
--

I suppose the "correct" solution is to make code generation a bit smarter. But 
as a band-aid solution we can use for now, would it make sense to just allow 
this 64KB limit to be raised?

What would the potential issues be with allowing this limit to be something 
much larger, like 512KB or even 1MB?

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean 

[jira] [Commented] (SPARK-18570) Consider supporting other R formula operators

2017-01-14 Thread Krishna Kalyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822922#comment-15822922
 ] 

Krishna Kalyan commented on SPARK-18570:


Thanks [~yanboliang] [~felixcheung],
We could start by supporting interaction term x*y. (x + y + x:y).  Could you 
please advice on what other operators we could support?.

Thanks,
Krishna


> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode

2017-01-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19151:

Assignee: Song Jun

> DataFrameWriter.saveAsTable should work with hive format with overwrite mode
> 
>
> Key: SPARK-19151
> URL: https://issues.apache.org/jira/browse/SPARK-19151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Song Jun
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19151) DataFrameWriter.saveAsTable should work with hive format with overwrite mode

2017-01-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19151.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> DataFrameWriter.saveAsTable should work with hive format with overwrite mode
> 
>
> Key: SPARK-19151
> URL: https://issues.apache.org/jira/browse/SPARK-19151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19221) Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native libraries properly

2017-01-14 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19221.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16584

> Add winutils binaries to Path in AppVeyor for Hadoop libraries to call native 
> libraries properly
> 
>
> Key: SPARK-19221
> URL: https://issues.apache.org/jira/browse/SPARK-19221
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SparkR
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems Hadoop libraries need {{hadoop.dll}} for native libraries in the 
> path. It is not a problem in tests for now because we are only testing SparkR 
> on Windows via AppVeyor but it can be a problem if we run Scala tests via 
> AppVeyor as below:
> {code}
>  - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 
> seconds, 937 milliseconds)
>org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. 
> org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:609)
>at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:599)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:159)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>at org.scalatest.Transformer.apply(Transformer.scala:22)
>at org.scalatest.Transformer.apply(Transformer.scala:20)
>at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>at org.scalatest.Suite$class.run(Suite.scala:1424)
>at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>at 

[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822762#comment-15822762
 ] 

Sean Owen commented on SPARK-19208:
---

The super sparse high-dimensional cases don't seem like cases where you'd scale 
features -- exactly because the values are sparse, and shifting the mean makes 
them non-sparse. You might just want scaling, but then I'm not sure this would 
be applicable to cases where the output of something like HashingTF is used.

I don't think computing a few extra sufficient statistics makes much observable 
difference relative to all the other overhead. It's just a few more bits of 
math. Maybe if there were a cleaner way to do it without copying code.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
> Attachments: WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2017-01-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822750#comment-15822750
 ] 

Sean Owen commented on SPARK-3249:
--

Yes, I remember someone else reported there may be new unidoc failures. If 
you're willing to fix them, go ahead.

> Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`
> -
>
> Key: SPARK-3249
> URL: https://issues.apache.org/jira/browse/SPARK-3249
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> If there are multiple overloaded versions of a method, we should make the 
> links more specific. Otherwise, `sbt/sbt unidoc` generates warning messages 
> like the following:
> {code}
> [warn] 
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala:305: The 
> link target "org.apache.spark.mllib.tree.DecisionTree$#trainClassifier" is 
> ambiguous. Several members fit the target:
> [warn] (input: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer],impurity: 
> String,maxDepth: Int,maxBins: Int): 
> org.apache.spark.mllib.tree.model.DecisionTreeModel in object DecisionTree 
> [chosen]
> [warn] (input: 
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: Map[Int,Int],impurity: String,maxDepth: 
> Int,maxBins: Int): org.apache.spark.mllib.tree.model.DecisionTreeModel in 
> object DecisionTree
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org