[jira] [Commented] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198502#comment-15198502
 ] 

Xiao Li commented on SPARK-13932:
-

Have you tried the latest 2.0 version? 

> CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
> --
>
> Key: SPARK-13932
> URL: https://issues.apache.org/jira/browse/SPARK-13932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Tien-Dung LE
>
> A complex aggregate query using condition in the aggregate function and GROUP 
> BY HAVING clause raises an exception. This issue only happens in Spark 
> version 1.6.+ but not in Spark 1.5.+.
> Here is a typical error message {code}
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> {code}
> Here is a code snippet to re-produce the error in a spark-shell session:
> {code}
> import sqlContext.implicits._
> case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
>   b: Int = (math.random*1e3).toInt,
>   n: Int = (math.random*1e3).toInt,
>   m: Double = (math.random*1e3))
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
> df.registerTempTable( "toto" )
> val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS 
> k3, GROUPING__ID"
> val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS 
> k2, SUM(m) AS k3, GROUPING__ID"
> val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"
> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> {code}
> And here is the full log
> {code}
> scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
> res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
> res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 

[jira] [Created] (SPARK-13968) User MurmurHash in for feature hashing

2016-03-19 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-13968:
--

 Summary: User MurmurHash in for feature hashing
 Key: SPARK-13968
 URL: https://issues.apache.org/jira/browse/SPARK-13968
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Nick Pentreath
Priority: Minor


Typically feature hashing is done on strings, i.e. feature names (or in the 
case of raw feature indexes, either the string representation of the numerical 
index can be used, or the index used "as-is" and not hashed).

It is common to use a well-distributed hash function such as MurmurHash3. This 
is the case in e.g. 
[Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].

Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14001) support multi-children Union in SQLBuilder

2016-03-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14001.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11818
[https://github.com/apache/spark/pull/11818]

> support multi-children Union in SQLBuilder
> --
>
> Key: SPARK-14001
> URL: https://issues.apache.org/jira/browse/SPARK-14001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13935) Other clients' connection hang up when someone do huge load

2016-03-19 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197491#comment-15197491
 ] 

Tao Wang commented on SPARK-13935:
--

after checking the latest codes on github, i believe same problem should be 
with it too.

below is the segment of jstack information.

{quote}
"HiveServer2-Handler-Pool: Thread-220" #220 daemon prio=5 os_prio=0 
tid=0x7fc390c4c800 nid=0x5fcb waiting for monitor entry [0x7fc367ac3000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:262)
- waiting to lock <0x000702871ce8> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:305)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:885)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:822)
at 
org.apache.spark.sql.hive.InnerHiveContext.setConf(InnerHiveContext.scala:618)
at 
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:541)
at 
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:540)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:540)
at 
org.apache.spark.sql.hive.InnerHiveContext.(InnerHiveContext.scala:102)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:55)
at 
org.apache.spark.sql.hive.HiveContext.newSession(HiveContext.scala:80)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.openSession(SparkSQLSessionManager.scala:78)
at 
org.apache.hive.service.cli.CLIService.openSessionWithImpersonation(CLIService.java:189)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.getSessionHandle(ThriftCLIService.java:654)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.OpenSession(ThriftCLIService.java:522)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$OpenSession.getResult(TCLIService.java:1257)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$OpenSession.getResult(TCLIService.java:1242)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:690)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
...
"pool-27-thread-18" #299 prio=5 os_prio=0 tid=0x7fc3918ad800 nid=0x954a 
runnable [0x7fc3722da000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0x0007141c4f38> (a java.io.BufferedInputStream)
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at 
org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
at 
org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
at 
org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
at 
org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at 
org.apache.hadoop.hive.thrift.TFilterTransport.readAll(TFilterTransport.java:62)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)

[jira] [Updated] (SPARK-13871) Add support for inferring filters from data constraints

2016-03-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13871:
-
Assignee: Sameer Agarwal

> Add support for inferring filters from data constraints
> ---
>
> Key: SPARK-13871
> URL: https://issues.apache.org/jira/browse/SPARK-13871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Now that we have an infrastructure for propagating data constraints, we 
> should generalize the NullFiltering optimizer rule in catalyst to 
> InferFiltersFromConstraints that can automatically infer all relevant filters 
> based on an operator's constraints while making sure of 2 things:
> (a) no redundant filters are generated, and 
> (b) filters that do not contribute to any further optimizations are not 
> generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2016-03-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200975#comment-15200975
 ] 

Dongjoon Hyun commented on SPARK-3249:
--

Hi, [~mengxr].

If this issue is still valid, may I do this ?

> Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`
> -
>
> Key: SPARK-3249
> URL: https://issues.apache.org/jira/browse/SPARK-3249
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> If there are multiple overloaded versions of a method, we should make the 
> links more specific. Otherwise, `sbt/sbt unidoc` generates warning messages 
> like the following:
> {code}
> [warn] 
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala:305: The 
> link target "org.apache.spark.mllib.tree.DecisionTree$#trainClassifier" is 
> ambiguous. Several members fit the target:
> [warn] (input: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer],impurity: 
> String,maxDepth: Int,maxBins: Int): 
> org.apache.spark.mllib.tree.model.DecisionTreeModel in object DecisionTree 
> [chosen]
> [warn] (input: 
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: Map[Int,Int],impurity: String,maxDepth: 
> Int,maxBins: Int): org.apache.spark.mllib.tree.model.DecisionTreeModel in 
> object DecisionTree
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13972:
---

 Summary: hive tests should fail if SQL generation failed
 Key: SPARK-13972
 URL: https://issues.apache.org/jira/browse/SPARK-13972
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198099#comment-15198099
 ] 

Xiao Li commented on SPARK-13900:
-

BroadcastNestedLoopJoin is much slower than BroadcastHashJoin. This is the 
major reason why your first plan is faster than the second one.

> Spark SQL queries with OR condition is not optimized properly
> -
>
> Key: SPARK-13900
> URL: https://issues.apache.org/jira/browse/SPARK-13900
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ashok kumar Rajendran
>
> I have a large table with few billions of rows and have a very small table 
> with 4 dimensional values. All the data is stored in parquet format. I would 
> like to get rows that match any of these dimensions. For example,
> Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR 
> A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = 
> B.dimension4.
> The query plan takes this as BroadcastNestedLoopJoin and executes for very 
> long time.
> If I execute this as Union queries, it takes around 1.5mins for each 
> dimension. Each query internally does BroadcastHashJoin.
> Select field1, field2 from A, B where A.dimension1 = B.dimension1
> UNION ALL
> Select field1, field2 from A, B where A.dimension2 = B.dimension2
> UNION ALL
> Select field1, field2 from A, B where  A.dimension3 = B.dimension3
> UNION ALL
> Select field1, field2 from A, B where  A.dimension4 = B.dimension4.
> This is obviously not an optimal solution as it makes multiple scanning at 
> same table but it gives result much better than OR condition. 
> Seems the SQL optimizer is not working properly which causes huge performance 
> impact on this type of OR query.
> Please correct me if I miss anything here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-19 Thread Roy Cecil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201550#comment-15201550
 ] 

Roy Cecil commented on SPARK-13820:
---

Davies, what is the roadmap for supporting correlated subqueries ? 

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14004:


Assignee: Cheng Lian  (was: Apache Spark)

> AttributeReference and Alias should only use their first qualifier to build 
> SQL representations
> ---
>
> Key: SPARK-14004
> URL: https://issues.apache.org/jira/browse/SPARK-14004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Current implementation joins all qualifiers, which is wrong.
> However, this doesn't cause any real SQL generation bugs as there is always 
> at most one qualifier for any given {{AttributeReference}} or {{Alias}}.
> We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to 
> represent qualifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13924) officially support multi-insert

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13924.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.0.0

> officially support multi-insert
> ---
>
> Key: SPARK-13924
> URL: https://issues.apache.org/jira/browse/SPARK-13924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13691) Scala and Python generate inconsistent results

2016-03-19 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197775#comment-15197775
 ] 

Bryan Cutler commented on SPARK-13691:
--

Since the problem comes from the structure of the code in the driver, it's not 
just specific to local mode, I believe.  For instance, with streaming kmeans, 
it can lead to an inconsistent model that is not updated as quickly as the 
Scala version would - which is what led to the flaky StreamingKMeans failures 
in SPARK-10086.  Whether or not it really leads to a problem in practice, I'm 
not too sure.

> Scala and Python generate inconsistent results
> --
>
> Key: SPARK-13691
> URL: https://issues.apache.org/jira/browse/SPARK-13691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>
> Here is an example that Scala and Python generate different results
> {code}
> Scala:
> scala> var i = 0
> i: Int = 0
> scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
> scala> rdd.collect()
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> scala> i += 1
> scala> rdd.collect()
> res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
> Python:
> >>> i = 0
> >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> >>> i += 1
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> {code}
> The difference is Scala will capture all variables' values when running a job 
> every time, but Python just captures variables' values once and always uses 
> them for all jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14020) VerifyError occurs after commit-6c2d894 in a Standalone cluster

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14020.
---
Resolution: Not A Problem

> VerifyError occurs after commit-6c2d894 in a Standalone cluster
> ---
>
> Key: SPARK-14020
> URL: https://issues.apache.org/jira/browse/SPARK-14020
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark 2.0-SNAPSHOT(6c2d894)
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Priority: Minor
>
> In a standalone cluster with several nodes, "java.lang.VerifyError: Cannot 
> inherit from final class" occurs when submitting an application.
> below is the stacktrace:
> 16/03/19 15:17:49 INFO SparkEnv: Registering MapOutputTracker
> 16/03/19 15:17:49 INFO SparkEnv: Registering BlockManagerMaster
> Exception in thread "main" java.lang.VerifyError: Cannot inherit from final 
> class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:333)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:180)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:262)
> at org.apache.spark.SparkContext.(SparkContext.scala:424)
> at 
> org.apache.spark.examples.SparkPageRank$.main(SparkPageRank.scala:55)
> at org.apache.spark.examples.SparkPageRank.main(SparkPageRank.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13972.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11782
[https://github.com/apache/spark/pull/11782]

> hive tests should fail if SQL generation failed
> ---
>
> Key: SPARK-13972
> URL: https://issues.apache.org/jira/browse/SPARK-13972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14020) VerifyError occurs after commit-6c2d894 in a Standalone cluster

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14020:
--
Target Version/s:   (was: 2.0.0)
Priority: Minor  (was: Blocker)
   Fix Version/s: (was: 2.0.0)

This must be a problem with your deployment. You're mixing old and new snapshot 
code. This shouldn't be possible if you're using code that was compiled in a 
consistent state. The only change to a final class was making one un-final, so 
you've got old code somewhere that thinks the class is final.

Do not set Blocker, Target or Fix Version BTW. Read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first 
before making a JIRA.

> VerifyError occurs after commit-6c2d894 in a Standalone cluster
> ---
>
> Key: SPARK-14020
> URL: https://issues.apache.org/jira/browse/SPARK-14020
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark 2.0-SNAPSHOT(6c2d894)
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Priority: Minor
>
> In a standalone cluster with several nodes, "java.lang.VerifyError: Cannot 
> inherit from final class" occurs when submitting an application.
> below is the stacktrace:
> 16/03/19 15:17:49 INFO SparkEnv: Registering MapOutputTracker
> 16/03/19 15:17:49 INFO SparkEnv: Registering BlockManagerMaster
> Exception in thread "main" java.lang.VerifyError: Cannot inherit from final 
> class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:333)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:180)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:262)
> at org.apache.spark.SparkContext.(SparkContext.scala:424)
> at 
> org.apache.spark.examples.SparkPageRank$.main(SparkPageRank.scala:55)
> at org.apache.spark.examples.SparkPageRank.main(SparkPageRank.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11319) PySpark silently accepts null values in non-nullable DataFrame fields.

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11319:


Assignee: Apache Spark

> PySpark silently accepts null values in non-nullable DataFrame fields.
> --
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Kevin Cox
>Assignee: Apache Spark
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198605#comment-15198605
 ] 

Xiao Li commented on SPARK-13863:
-

Please check if the table definition is different. 

> TPCDS query 66 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13863
> URL: https://issues.apache.org/jira/browse/SPARK-13863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 66 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
> SparkSQL returns 6355232.185385704, expected 6355232.31
> Actual results:
> {noformat}
> [null,null,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
> [Bad cards must make.,621234,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
> [Conventional childr,977787,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
> [Doors canno,294242,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
> [Important issues liv,138504,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7]
> {noformat}
> Expected results:
> {noformat}
> 

[jira] [Commented] (SPARK-13978) [GSoC 2016] Build monitoring UI and infrastructure for Spark SQL and structured streaming

2016-03-19 Thread Ray Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200908#comment-15200908
 ] 

Ray Zhang commented on SPARK-13978:
---

Hi,

I'm a first year grad student in computer engineering. I've been using Spark 
recently, and I'm interested in this project. Basically I'm very familiar with 
web development, as I once developed a similar UI monitor  during my intern at 
a startup.

What are some guidelines of this project? Are there any pre-requisites?

Thanks,
Vero

> [GSoC 2016] Build monitoring UI and infrastructure for Spark SQL and 
> structured streaming
> -
>
> Key: SPARK-13978
> URL: https://issues.apache.org/jira/browse/SPARK-13978
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Yin Huai
>  Labels: GSOC2016
>
> Will provide more details later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13928.
-
Resolution: Fixed

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13998) HashingTF should extend UnaryTransformer

2016-03-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13998:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-13964

> HashingTF should extend UnaryTransformer
> 
>
> Key: SPARK-13998
> URL: https://issues.apache.org/jira/browse/SPARK-13998
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Currently 
> [HashingTF|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala#L37]
>  extends {{Transformer with HasInputCol with HasOutputCol}}, but there is a 
> helper 
> [UnaryTransformer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L79-L80]
>  abstract class for exactly the reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13983) HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 1.6 version (both multi-session and single session)

2016-03-19 Thread Teng Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Qiu updated SPARK-13983:
-
Environment: 
ubuntu, spark 1.6.0 standalone, spark 1.6.1 standalone
(tried spark branch-1.6 snapshot as well)
compiled with scala 2.10.5 and hadoop 2.6
(-Phadoop-2.6 -Psparkr -Phive -Phive-thriftserver)

  was:
ubuntu, scala 2.10.5, hadoop 2.6
spark 1.6.0 standalone, spark 1.6.1 standalone
(tried spark branch-1.6 snapshot as well)


> HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 
> 1.6 version (both multi-session and single session)
> --
>
> Key: SPARK-13983
> URL: https://issues.apache.org/jira/browse/SPARK-13983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: ubuntu, spark 1.6.0 standalone, spark 1.6.1 standalone
> (tried spark branch-1.6 snapshot as well)
> compiled with scala 2.10.5 and hadoop 2.6
> (-Phadoop-2.6 -Psparkr -Phive -Phive-thriftserver)
>Reporter: Teng Qiu
>
> HiveThriftServer2 should be able to get "\--hiveconf" or ''\-\-hivevar" 
> variables from JDBC client, either from command line parameter of beeline, 
> such as
> {{beeline --hiveconf spark.sql.shuffle.partitions=3 --hivevar 
> db_name=default}}
> or from JDBC connection string, like
> {{jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default}}
> this worked in spark version 1.5.x, but after upgraded to 1.6, it doesn't 
> work.
> to reproduce this issue, try to connect to HiveThriftServer2 with beeline:
> {code}
> bin/beeline -u jdbc:hive2://localhost:1 \
> --hiveconf spark.sql.shuffle.partitions=3 \
> --hivevar db_name=default
> {code}
> or
> {code}
> bin/beeline -u 
> jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default
> {code}
> will get following results:
> {code}
> 0: jdbc:hive2://localhost:1> set spark.sql.shuffle.partitions;
> +---++--+
> |  key  | value  |
> +---++--+
> | spark.sql.shuffle.partitions  | 200|
> +---++--+
> 1 row selected (0.192 seconds)
> 0: jdbc:hive2://localhost:1> use ${db_name};
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> '$' '{' 'db_name' in switch database statement; line 1 pos 4 (state=,code=0)
> {code}
> -
> but this bug does not affect current versions of spark-sql CLI, following 
> commands works:
> {code}
> bin/spark-sql --master local[2] \
>   --hiveconf spark.sql.shuffle.partitions=3 \
>   --hivevar db_name=default
> spark-sql> set spark.sql.shuffle.partitions
> spark.sql.shuffle.partitions   3
> Time taken: 1.037 seconds, Fetched 1 row(s)
> spark-sql> use ${db_name};
> OK
> Time taken: 1.697 seconds
> {code}
> so I think it may caused by this change: 
> https://github.com/apache/spark/pull/8909 ( [SPARK-10810] [SPARK-10902] [SQL] 
> Improve session management in SQL )
> perhaps by calling {{hiveContext.newSession}}, the variables from 
> {{sessionConf}} were not loaded into the new session? 
> (https://github.com/apache/spark/pull/8909/files#diff-8f8b7f4172e8a07ff20a4dbbbcc57b1dR69)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13281) Switch broadcast of RDD to exception from warning

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13281:
--
Assignee: Wesley Tang

> Switch broadcast of RDD to exception from warning
> -
>
> Key: SPARK-13281
> URL: https://issues.apache.org/jira/browse/SPARK-13281
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Assignee: Wesley Tang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> In the comments we log a warning when a user tries to broadcast an RDD for 
> compatibility with old programs which may have broadcast RDDs without using 
> the resulting broadcast variable. Since we're moving to 2.0 it seems like now 
> would be a good opportunity to replace that warning with an exception rather 
> than depend on the developer finding the warning message.
> Related to https://issues.apache.org/jira/browse/SPARK-5063 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7992) Hide private classes/objects in in generated Java API doc

2016-03-19 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197877#comment-15197877
 ] 

Jakob Odersky commented on SPARK-7992:
--

I'll check it out

> Hide private classes/objects in in generated Java API doc
> -
>
> Key: SPARK-7992
> URL: https://issues.apache.org/jira/browse/SPARK-7992
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>
> After SPARK-5610, we found that private classes/objects still show up in the 
> generated Java API doc, e.g., under `org.apache.spark.api.r` we can see
> {code}
> BaseRRDD
> PairwiseRRDD
> RRDD
> SpecialLengths
> StringRRDD
> {code}
> We should update genjavadoc to hide those private classes/methods. The best 
> approach is to find a good mapping from Scala private to Java, and merge it 
> into the main genjavadoc repo. A WIP PR is at 
> https://github.com/typesafehub/genjavadoc/pull/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13827) Can't add subquery to an operator with same-name outputs while generate SQL string

2016-03-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13827:
-
Assignee: Wenchen Fan  (was: Apache Spark)

> Can't add subquery to an operator with same-name outputs while generate SQL 
> string
> --
>
> Key: SPARK-13827
> URL: https://issues.apache.org/jira/browse/SPARK-13827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13977) Bring back ShuffledHashJoin

2016-03-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13977.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11788
[https://github.com/apache/spark/pull/11788]

> Bring back ShuffledHashJoin
> ---
>
> Key: SPARK-13977
> URL: https://issues.apache.org/jira/browse/SPARK-13977
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> ShuffledHashJoin is still useful when:
> 1) any partition of the build side could fit in memory
> 2) the build side is much smaller than stream side, the building hash table 
> on smaller side should be faster than sorting the bigger side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-19 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200465#comment-15200465
 ] 

Dilip Biswal commented on SPARK-13859:
--

Hello,

Just checked the original spec for this query from tpcds website. Here is the 
template for Q38.

{code}
[_LIMITA] select [_LIMITB] count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
  where store_sales.ss_sold_date_sk = date_dim.d_date_sk
  and store_sales.ss_customer_sk = customer.c_customer_sk
  and d_month_seq between [DMS] and [DMS] + 11
  intersect
select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
  where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
  and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
  and d_month_seq between [DMS] and [DMS] + 11
  intersect
select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
  where web_sales.ws_sold_date_sk = date_dim.d_date_sk
  and web_sales.ws_bill_customer_sk = customer.c_customer_sk
  and d_month_seq between [DMS] and [DMS] + 11
) hot_cust
[_LIMITC];
{code}

In this case the query in spec uses intersect operator where the  implicitly 
generated join conditions use null safe comparison. 
In other-words, if we ran the query as is from spec then it would have worked.

However the query in this JIRA has user supplied join conditions and uses "=". 
In my knowledge in SQL, the semantics
of equal operator is well defined. So i don't think its a spark SQL issue. 

[~rxin] [~marmbrus] Please let us know your thoughts..



> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13820) TPC-DS Query 10 fails to compile

2016-03-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201896#comment-15201896
 ] 

Davies Liu commented on SPARK-13820:


May be 2.1?

I had an prototype for that: https://github.com/apache/spark/pull/10706

But we had not enough resource to finish them in 2.0.

> TPC-DS Query 10 fails to compile
> 
>
> Key: SPARK-13820
> URL: https://issues.apache.org/jira/browse/SPARK-13820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 10 fails to compile with the following error.
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Parsing error: KW_SELECT )=> ( KW_EXISTS subQueryExpression ) -> ^( 
> TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_EXISTS ) subQueryExpression ) );])
> at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
> at org.antlr.runtime.DFA.predict(DFA.java:144)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8155)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
> Query is pasted here for easy reproduction
>  select
>   cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   count(*) cnt1,
>   cd_purchase_estimate,
>   count(*) cnt2,
>   cd_credit_rating,
>   count(*) cnt3,
>   cd_dep_count,
>   count(*) cnt4,
>   cd_dep_employed_count,
>   count(*) cnt5,
>   cd_dep_college_count,
>   count(*) cnt6
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_moy between 1 and 1+3) ss_wh1 ON c.c_customer_sk = 
> ss_wh1.ss_customer_sk
>  where
>   ca_county in ('Rush County','Toole County','Jefferson County','Dona Ana 
> County','La Porte County') and
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk as customer_sk
> from web_sales,date_dim
> where
>   web_sales.ws_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
> UNION ALL
> select cs_ship_customer_sk as customer_sk
> from catalog_sales,date_dim
> where
>   catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and
>   d_year = 2002 and
>   d_moy between 1 and 1+3
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by cd_gender,
>   cd_marital_status,
>   cd_education_status,
>   cd_purchase_estimate,
>   cd_credit_rating,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>   limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13955) Spark in yarn mode fails

2016-03-19 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13955:
---
Description: 
I ran spark-shell in yarn client, but from the logs seems the spark assembly 
jar is not uploaded to HDFS. This may be known issue in the process of 
SPARK-11157, create this ticket to track this issue. [~vanzin]
{noformat}
16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
including 384 MB overhead
16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
container
16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
16/03/17 17:57:48 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
16/03/17 17:57:49 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
16/03/17 17:57:49 INFO Client: Uploading resource 
file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
 -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
with modify permissions: Set(jzhang)
16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
{noformat}

message in AM container
{noformat}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{noformat}

  was:
I ran spark-shell in yarn client, but from the logs seems the spark assembly 
jar is not uploaded to HDFS. This may be known issue in the process of 
SPARK-11157, create this ticket to track this issue. [~vanzin]
{noformat}
16/03/17 11:58:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
16/03/17 11:58:59 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.10.jar
16/03/17 11:58:59 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.11.jar
16/03/17 11:59:00 INFO Client: Uploading resource 
file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-36cacbad-ca5b-482b-8ca8-607499acaaba/__spark_conf__4427292248554277597.zip
 -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/__spark_conf__4427292248554277597.zip
{noformat}

message in AM container
{noformat}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{noformat}


> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> 

[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-03-19 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199811#comment-15199811
 ] 

Ohad Raviv commented on SPARK-13313:


Hi,
I am trying to use graphx's SCC and was very concerned with this issue, so I 
have taken this dataset and ran it with python's networkx 
strongly_connected_components function and got exactly the same results of 519 
SCCs with maximal size = 4051.
So although I don't know what is the real result, the fact that both algorithms 
agree make me believe that they are correct.
I have also looked at the code and it looks fine to me, I don't agree that you 
should change the edge direction on line 89.

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13981) Improve Filter generated code to defer variable evaluation within operator

2016-03-19 Thread Nong Li (JIRA)
Nong Li created SPARK-13981:
---

 Summary: Improve Filter generated code to defer variable 
evaluation within operator
 Key: SPARK-13981
 URL: https://issues.apache.org/jira/browse/SPARK-13981
 Project: Spark
  Issue Type: Improvement
Reporter: Nong Li
Priority: Minor


We can improve the generated filter code by deferring evaluating variables 
until just before they are needed.

For example. x > 1 and y > b

we can do
{code}
x = ...
if (x <= 1) continue
y = ...
{code}

instead of currently where we do
{code}
x = ...
y = ...
if (x <= 1) continue
...
{code}

This is helpful if evaluating y has any cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13970) Add Non-Negative Matrix Factorization to MLlib

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13970:


Assignee: Apache Spark

> Add Non-Negative Matrix Factorization to MLlib
> --
>
> Key: SPARK-13970
> URL: https://issues.apache.org/jira/browse/SPARK-13970
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> NMF is to find two non-negative matrices (W, H) whose product W * H.T 
> approximates the non-negative matrix X. This factorization can be used for 
> example for dimensionality reduction, source separation or topic extraction.
> NMF was implemented in several packages:
> Scikit-Learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)
> R-NMF (https://cran.r-project.org/web/packages/NMF/index.html)
> LibNMF (http://www.univie.ac.at/rlcta/software/)
> I have implemented in MLlib according to the following papers:
> Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data 
> Analysis on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf)
> Algorithms for Non-negative Matrix Factorization 
> (http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)
> It can be used like this:
> val m = 4
> val n = 3
> val data = Seq(
> (0L, Vectors.dense(0.0, 1.0, 2.0)),
> (1L, Vectors.dense(3.0, 4.0, 5.0)),
> (3L, Vectors.dense(9.0, 0.0, 1.0))
>   ).map(x => IndexedRow(x._1, x._2))
> val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix()
> val k = 2
> // run the nmf algo
> val r = NMF.solve(A, k, 10)
> val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
> >>> org.apache.spark.mllib.linalg.DenseMatrix =
> 1.1349295096806706  1.4423101890626953E-5
> 3.453054133110303   0.46312492493865615
> 0.0 0.0
> 0.3133764134585149  2.70684017255672
> val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
> >>> org.apache.spark.mllib.linalg.DenseMatrix =
> 0.4184163313845057  3.2719352525149286
> 1.121880126136450.002939823716977737
> 1.456499371939653   0.18992996116069297
> val R = rW.multiply(rH.transpose)
> >>> org.apache.spark.mllib.linalg.DenseMatrix =
> 0.4749202332761286  1.2732549038779071.6530268574248572
> 2.9601290106732367  3.8752743120480346   5.117332475154927
> 0.0 0.0  0.0
> 8.987727592773672   0.35952840319637736  0.9705425982249293
> val AD = A.toBlockMatrix().toLocalMatrix()
> >>> org.apache.spark.mllib.linalg.Matrix =
> 0.0  1.0  2.0
> 3.0  4.0  5.0
> 0.0  0.0  0.0
> 9.0  0.0  1.0
> var loss = 0.0
> for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) {
>val diff = AD(i, j) - R(i, j)
>loss += diff * diff
> }
> loss
> >>> Double = 0.5817999580912183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13942) Remove Shark-related docs and visibility for 2.x

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13942:


Assignee: Apache Spark

> Remove Shark-related docs and visibility for 2.x
> 
>
> Key: SPARK-13942
> URL: https://issues.apache.org/jira/browse/SPARK-13942
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Spark Core
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> `Shark` was merged into `Spark SQL` since [July 
> 2014|https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html].
>  
> The followings seem to be the only legacy.
> *Migration Guide*
> {code:title=sql-programming-guide.md|borderStyle=solid}
> - ## Migration Guide for Shark Users
> - ...
> - ### Scheduling
> - ...
> - ### Reducer number
> - ...
> - ### Caching
> {code}
> *SparkEnv visibility and comments*
> {code:title=SparkEnv.scala|borderStyle=solid}
> - *
> - * NOTE: This is not intended for external use. This is exposed for Shark 
> and may be made private
> - *   in a future release.
>   */
>  @DeveloperApi
> -class SparkEnv (
> +private[spark] class SparkEnv (
> {code}
> For Spark 2.x, we had better clean up those docs and comments in any way. 
> However, the visibility of `SparkEnv` class might be controversial. 
> At the first attempt, this issue proposes to change both stuff according to 
> the note(`This is exposed for Shark`). During review process, the change on 
> visibility might be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-19 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200152#comment-15200152
 ] 

Cody Koeninger commented on SPARK-13877:


Thumbs down on renaming the package name as well... from a practical point of 
view, we may need things to be in the same package hierarchy because of access 
modifiers.

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14020) VerifyError occurs after commit-6c2d894 in a Standalone cluster

2016-03-19 Thread Ernest (JIRA)
Ernest created SPARK-14020:
--

 Summary: VerifyError occurs after commit-6c2d894 in a Standalone 
cluster
 Key: SPARK-14020
 URL: https://issues.apache.org/jira/browse/SPARK-14020
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
 Environment: Spark 2.0-SNAPSHOT(6c2d894)
Single Rack
Standalone mode scheduling
8 node cluster
16 cores & 64G RAM / node
Data Replication factor of 3

Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.

Reporter: Ernest
Priority: Blocker
 Fix For: 2.0.0


In a standalone cluster with several nodes, "java.lang.VerifyError: Cannot 
inherit from final class" occurs when submitting an application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13975) Cannot specify extra libs for executor from /extra-lib

2016-03-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199662#comment-15199662
 ] 

Sean Owen commented on SPARK-13975:
---

You would generally control the environment Spark runs in, and would use this 
to put files in a known uniform place on the classpath. Dependent libraries 
should be bundled with your app though if you need to control their 
distribution.

> Cannot specify extra libs for executor from /extra-lib
> --
>
> Key: SPARK-13975
> URL: https://issues.apache.org/jira/browse/SPARK-13975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Leonid Poliakov
>
> If you build a framework on top of spark and want to bundle it with the 
> spark, there is no easy way to add your framework libs to executor classpath.
> Let's say I want to add my custom libs to {{/extra-lib}} folder, ship the new 
> bundle (with my libs in it) to nodes, run the bundle. I want executors on 
> node to always automatically load my libs from {{/extra-lib}}, because that's 
> how future developers would use framework out-of-the-box.
> The config doc says you can specify extraClasspath for the executor in 
> {{spark-defaults.conf}}, which is good because custom config may be put in 
> the bundle for the framework, but the syntax of the property is unclear.
> You can basically specify the value that will be appended to {{-cp}} for a 
> executor Java process, so it follows the Java how-to-set-classpath rules, so 
> basically you have two options here:
> 1. specify absolute path
> bq. spark.executor.extraClassPath /home/user/Apps/spark-bundled/extra-lib/*
> 2. specify relative path
> bq. spark.executor.extraClassPath ../../../extra-lib/*
> But none of these ways look good: absolute path won't work at all since you 
> cannot know where users will put the bundle, relative path looks weird 
> because executor will have it's work directory set to something like 
> {{/work/app-20160316070310-0002/0}} and can also be broken if custom worker 
> folder is configured.
> So, it's required to have a proper way to bundle custom libs and set executor 
> classpath to load them up.
> *Expected*: you can specify {{spark.executor.extraClassPath}} relative to 
> {{$SPARK_HOME}} using placeholders, e.g. with next syntax:
> bq. spark.executor.extraClassPath ${home}/extra-lib/*
> Code will resolve placeholders in properties with a proper path
> Executor will get absolute path in {{-cp}} this way
> *Actual*: you cannot specify extra libs for executor relative to 
> {{$SPARK_HOME}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13972:


Assignee: (was: Apache Spark)

> hive tests should fail if SQL generation failed
> ---
>
> Key: SPARK-13972
> URL: https://issues.apache.org/jira/browse/SPARK-13972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14020) VerifyError occurs after commit-6c2d894 in a Standalone cluster

2016-03-19 Thread Ernest (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ernest updated SPARK-14020:
---
Description: 
In a standalone cluster with several nodes, "java.lang.VerifyError: Cannot 
inherit from final class" occurs when submitting an application.

below is the stacktrace:
16/03/19 15:17:49 INFO SparkEnv: Registering MapOutputTracker
16/03/19 15:17:49 INFO SparkEnv: Registering BlockManagerMaster
Exception in thread "main" java.lang.VerifyError: Cannot inherit from final 
class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:333)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:180)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:262)
at org.apache.spark.SparkContext.(SparkContext.scala:424)
at org.apache.spark.examples.SparkPageRank$.main(SparkPageRank.scala:55)
at org.apache.spark.examples.SparkPageRank.main(SparkPageRank.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

  was:In a standalone cluster with several nodes, "java.lang.VerifyError: 
Cannot inherit from final class" occurs when submitting an application.


> VerifyError occurs after commit-6c2d894 in a Standalone cluster
> ---
>
> Key: SPARK-14020
> URL: https://issues.apache.org/jira/browse/SPARK-14020
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark 2.0-SNAPSHOT(6c2d894)
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 3
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
>Reporter: Ernest
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In a standalone cluster with several nodes, "java.lang.VerifyError: Cannot 
> inherit from final class" occurs when submitting an application.
> below is the stacktrace:
> 16/03/19 15:17:49 INFO SparkEnv: Registering MapOutputTracker
> 16/03/19 15:17:49 INFO SparkEnv: Registering BlockManagerMaster
> Exception in thread "main" java.lang.VerifyError: Cannot inherit from final 
> class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:333)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:180)
> at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:262)
> at 

[jira] [Assigned] (SPARK-13989) Remove non-vectorized/unsafe-row parquet record reader

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13989:


Assignee: Apache Spark

> Remove non-vectorized/unsafe-row parquet record reader
> --
>
> Key: SPARK-13989
> URL: https://issues.apache.org/jira/browse/SPARK-13989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Minor
>
> Clean up the new parquet record reader by removing the non-vectorized parquet 
> reader code from `UnsafeRowParquetRecordReader`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13908) Limit not pushed down

2016-03-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202604#comment-15202604
 ] 

Liang-Chi Hsieh edited comment on SPARK-13908 at 3/19/16 7:32 AM:
--

Rethink this issue, I think it should not be related to pushdown of limit. 
Because the latest CollectLimit only takes few rows (here is only 1 row) from 
the iterator of data, it should not scan all the data.


was (Author: viirya):
Rethink this issue, I think it should not related to pushdown of limit. Because 
the latest CollectLimit only takes few rows (here is only 1 row) from the 
iterator of data, it should not scan all the data.

> Limit not pushed down
> -
>
> Key: SPARK-13908
> URL: https://issues.apache.org/jira/browse/SPARK-13908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark compiled from git with commit 53ba6d6
>Reporter: Luca Bruno
>  Labels: performance
>
> Hello,
> I'm doing a simple query like this on a single parquet file:
> {noformat}
> SELECT *
> FROM someparquet
> LIMIT 1
> {noformat}
> The someparquet table is just a parquet read and registered as temporary 
> table.
> The query takes as much time (minutes) as it would by scanning all the 
> records, instead of just taking the first record.
> Using parquet-tools head is instead very fast (seconds), hence I guess it's a 
> missing optimization opportunity from spark.
> The physical plan is the following:
> {noformat}
> == Physical Plan ==   
>   
> CollectLimit 1
> +- WholeStageCodegen
>:  +- Scan ParquetFormat part: struct<>, data: struct<>[...] 
> InputPaths: hdfs://...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200703#comment-15200703
 ] 

Apache Spark commented on SPARK-13993:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11807

> PySpark ml.feature.RFormula/RFormulaModel support export/import
> ---
>
> Key: SPARK-13993
> URL: https://issues.apache.org/jira/browse/SPARK-13993
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Minor
>
> Add save/load for RFormula and its model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13991) Extend mvn enforcer rule

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200624#comment-15200624
 ] 

Apache Spark commented on SPARK-13991:
--

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/11803

> Extend mvn enforcer rule
> 
>
> Key: SPARK-13991
> URL: https://issues.apache.org/jira/browse/SPARK-13991
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Reporter: Jean-Baptiste Onofré
>
> Right now, the enforcer plugin forces the usage of one specific Maven version 
> (3.3.9).
> As the build works fine with other Maven 3.3.x version (tested with Maven 
> 3.3.3), it would be more flexible to extend the Maven enforcer rule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14018) BenchmarkWholeStageCodegen should accept 64-bit num records

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14018.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> BenchmarkWholeStageCodegen should accept 64-bit num records
> ---
>
> Key: SPARK-14018
> URL: https://issues.apache.org/jira/browse/SPARK-14018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> 500L << 20 is actually pretty close to 32-bit int limit. I was trying to 
> increase this to 500L << 23 and got negative numbers instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13995) Constraints should take care of Cast

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13995:


Assignee: (was: Apache Spark)

> Constraints should take care of Cast
> 
>
> Key: SPARK-13995
> URL: https://issues.apache.org/jira/browse/SPARK-13995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We infer relative constraints from logical plan's expressions. However, we 
> don't consider Cast expression now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-19 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197536#comment-15197536
 ] 

Cody Koeninger commented on SPARK-12177:


My fork is working at a very basic level for caching consumers, preferred 
locations, dynamic topicpartitions, and being able to commit offsets into 
kafka.  Taking a configured consumer also should allow some degree of control 
over offset generation policy, just by wrapping a consumer to return different 
values for assignment() or position().  Unit tests are passing but would need a 
lot more manual testing, I'm sure there are lots of rough edges.

Given the discussion in SPARK-13877 about moving kakfa integration to a 
separate repo, I'm going to hold off on any more work until that's decided.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198559#comment-15198559
 ] 

JESSE CHEN commented on SPARK-13865:


yes sir.

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13815) Provide better Exception messages in Pipeline load methods

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13815:


Assignee: (was: Apache Spark)

> Provide better Exception messages in Pipeline load methods
> --
>
> Key: SPARK-13815
> URL: https://issues.apache.org/jira/browse/SPARK-13815
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
> Environment: today's build of 2.0.0-SNAPSHOT
>Reporter: Jacek Laskowski
>Priority: Minor
>
> The following code that loads a {{Pipeline}} from an empty {{metadata}} file 
> throws an exception (expected) that says nothing about the real cause of it.
> {code}
> $ ls -l hello-pipeline/metadata
> -rw-r--r--  1 jacek  staff  0 11 mar 09:00 hello-pipeline/metadata
> scala> Pipeline.read.load("hello-pipeline")
> ...
> java.lang.UnsupportedOperationException: empty collection
> at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1344)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.first(RDD.scala:1341)
> at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:285)
> at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:253)
> at 
> org.apache.spark.ml.Pipeline$PipelineReader.load(Pipeline.scala:203)
> at 
> org.apache.spark.ml.Pipeline$PipelineReader.load(Pipeline.scala:197)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198738#comment-15198738
 ] 

Reynold Xin commented on SPARK-13118:
-

[~jodersky] if you find other problems, please create new tickets for them. 
Thanks.


> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Jakob Odersky
> Fix For: 2.0.0
>
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13941) kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker

2016-03-19 Thread Hurshal Patel (JIRA)
Hurshal Patel created SPARK-13941:
-

 Summary: kafka.cluster.BrokerEndPoint cannot be cast to 
kafka.cluster.Broker
 Key: SPARK-13941
 URL: https://issues.apache.org/jira/browse/SPARK-13941
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1
Reporter: Hurshal Patel


I am connecting to a Kafka cluster with the following (anonymized) code:

{code:scala}
  var stream = KafkaUtils.createDirectStreamFromZookeeper[String, Array[Byte], 
StringDecoder, DefaultDecoder](
  ssc, kafkaParams, topics)
  stream.foreachRDD { rdd =>
val df = sqlContext.createDataFrame(rdd.map(bytesToString), stringSchema)
df.foreachPartition { partition => 
  val targetNode = chooseTarget(TaskContext.partitionId)
  loadPartition(targetNode, partition)
}
  }
{code}

I am using Kafka 0.8.2.0-1.kafka1.2.0.p0.2 (Cloudera CDH 5.3.1) and Spark 1.4.1 
and this works fine.

After upgrading to Spark 1.5.1, my tasks are failing (stacktrace is below). Are 
there any notable changes to the KafkaDirectStream or KafkaRDD that would cause 
this or does Cloudera's Kafka distribution have known issues with 1.5.1?

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 12407.0 failed 4 times, most recent failure: Lost task 5.3 in stage 
12407.0 (TID 55638, 172.18.203.25): org.apache.spark.SparkException: Couldn't 
connect to leader for topic XXX: java.lang.ClassCastException: 
kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at scala.util.Either.fold(Either.scala:97)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.connectLeader(KafkaRDD.scala:163)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:155)
at org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:135)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 

[jira] [Updated] (SPARK-12789) Support order by position in SQL

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12789:

Description: 
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1, c3
{noformat}

We should make sure this also works with select *.


  was:
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1, c3
{noformat}





> Support order by position in SQL
> 
>
> Key: SPARK-12789
> URL: https://issues.apache.org/jira/browse/SPARK-12789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3 from tbl order by 1, 3
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3 from tbl order by c1, c3
> {noformat}
> We should make sure this also works with select *.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13938) word2phrase feature created in ML

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197890#comment-15197890
 ] 

Apache Spark commented on SPARK-13938:
--

User 's4weng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11766

> word2phrase feature created in ML
> -
>
> Key: SPARK-13938
> URL: https://issues.apache.org/jira/browse/SPARK-13938
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Steve Weng
>Priority: Critical
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
> transforms a sentence of words into one where certain individual consecutive 
> words are concatenated by using a training model/estimator (e.g. "I went to 
> New York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13978) [GSoC 2016] Build monitoring UI and infrastructure for Spark SQL and structured streaming

2016-03-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13978:
-
Labels: GSOC2016 mentor  (was: GSOC2016)

> [GSoC 2016] Build monitoring UI and infrastructure for Spark SQL and 
> structured streaming
> -
>
> Key: SPARK-13978
> URL: https://issues.apache.org/jira/browse/SPARK-13978
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Yin Huai
>  Labels: GSOC2016, mentor
>
> Will provide more details later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13997) Use Hadoop 2.0 default value for compression in data sources

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200984#comment-15200984
 ] 

Apache Spark commented on SPARK-13997:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11806

> Use Hadoop 2.0 default value for compression in data sources
> 
>
> Key: SPARK-13997
> URL: https://issues.apache.org/jira/browse/SPARK-13997
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, JSON, TEXT and CSV data sources use {{CompressionCodecs}} class to 
> set compression configurations via {{option("compress", "codec")}}.
> I made this uses Hadoop 1.x default value (block level compression). However, 
> the default value in Hadoop 2.x is record level compression as described in 
> [mapred-site.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml].
> Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default 
> values.
> According to [Hadoop Definitive Guide 3th 
> edition|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html],
>  it looks configurations for the unit of compression (record or block).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11011) UserDefinedType serialization should be strongly typed

2016-03-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11011.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11379
[https://github.com/apache/spark/pull/11379]

> UserDefinedType serialization should be strongly typed
> --
>
> Key: SPARK-11011
> URL: https://issues.apache.org/jira/browse/SPARK-11011
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: John Muller
>Priority: Minor
>  Labels: UDT
> Fix For: 2.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> UDT's serialize method takes an Any rather than the actual type parameter.  
> The issue lies in CatalystTypeConverters convertToCatalyst(a: Any): Any 
> method, which pattern matches against a hardcoded list of built-in SQL types.
> Planned fix is to allow the UDT to supply the CatalystTypeConverter to use 
> via a new public method on the abstract class UserDefinedType that allows the 
> implementer to strongly type those conversions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error

2016-03-19 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197944#comment-15197944
 ] 

kevin yu commented on SPARK-13831:
--

The same query will fail at spark sql 2.0 . And the failure can simply to 
select c_customer_sk from customer where exists (select cr_refunded_customer_sk 
from catalog_returns)

or 

select c_customer_sk from customer where exists (select cr_refunded_customer_sk 
from catalog_returns where cr_refunded_customer_sk = customer.c_customer_sk)

in Hive, it can pass the syntax. 
[~davies] can you confirm that spark sql is not supporting subquery with exist 
yet? 

> TPC-DS Query 35 fails with the following compile error
> --
>
> Key: SPARK-13831
> URL: https://issues.apache.org/jira/browse/SPARK-13831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Roy Cecil
>
> TPC-DS Query 35 fails with the following compile error.
> Scala.NotImplementedError: 
> scala.NotImplementedError: No parse rules for ASTNode type: 864, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR 1, 439,797, 1370
>   TOK_SUBQUERY_OP 1, 439,439, 1370
> exists 1, 439,439, 1370
>   TOK_QUERY 1, 441,797, 1508
> Pasting Query 35 for easy reference.
> select
>   ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   count(*) cnt1,
>   min(cd_dep_count) cd_dep_count1,
>   max(cd_dep_count) cd_dep_count2,
>   avg(cd_dep_count) cd_dep_count3,
>   cd_dep_employed_count,
>   count(*) cnt2,
>   min(cd_dep_employed_count) cd_dep_employed_count1,
>   max(cd_dep_employed_count) cd_dep_employed_count2,
>   avg(cd_dep_employed_count) cd_dep_employed_count3,
>   cd_dep_college_count,
>   count(*) cnt3,
>   min(cd_dep_college_count) cd_dep_college_count1,
>   max(cd_dep_college_count) cd_dep_college_count2,
>   avg(cd_dep_college_count) cd_dep_college_count3
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN
>   (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_qoy < 4) ss_wh1
>   ON c.c_customer_sk = ss_wh1.ss_customer_sk
>  where
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk  as customer_sk
> from web_sales,date_dim
> where
>   ws_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>UNION ALL
> select cs_ship_customer_sk  as customer_sk
> from catalog_sales,date_dim
> where
>   cs_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13987) Build fails due to scala version mismatch between

2016-03-19 Thread JIRA
Jean-Baptiste Onofré created SPARK-13987:


 Summary: Build fails due to scala version mismatch between 
 Key: SPARK-13987
 URL: https://issues.apache.org/jira/browse/SPARK-13987
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Jean-Baptiste Onofré


Build fails on master, due to test fails in launcher:

{code}
Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.046 sec <<< 
FAILURE! - in org.apache.spark.launcher.SparkSubmitCommandBuilderSuite
testExamplesRunner(org.apache.spark.launcher.SparkSubmitCommandBuilderSuite)  
Time elapsed: 0.01 sec  <<< ERROR!
java.lang.IllegalStateException: Examples jars directory 
'/home/jbonofre/Workspace/spark/examples/target/scala-2.11/jars' does not exist.
at 
org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.buildCommand(SparkSubmitCommandBuilderSuite.java:307)
at 
org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testExamplesRunner(SparkSubmitCommandBuilderSuite.java:164)
{code}

The reason is that the scala version by default in example is still 2.10, so 
the file names don't match:

{code}
spark/examples/target$ ls -l|grep -i 2.10
drwxrwxr-x 4 jbonofre jbonofre4096 Feb  3 16:39 scala-2.10
-rw-rw-r-- 1 jbonofre jbonofre 1899057 Feb  3 16:39 
spark-examples_2.10-1.6.0-SNAPSHOT.jar
-rw-rw-r-- 1 jbonofre jbonofre 1320517 Feb  3 16:40 
spark-examples_2.10-1.6.0-SNAPSHOT-javadoc.jar
-rw-rw-r-- 1 jbonofre jbonofre  390527 Feb  3 16:40 
spark-examples_2.10-1.6.0-SNAPSHOT-sources.jar
-rw-rw-r-- 1 jbonofre jbonofre   12333 Feb  3 16:39 
spark-examples_2.10-1.6.0-SNAPSHOT-tests.jar
-rw-rw-r-- 1 jbonofre jbonofre8875 Feb  3 16:40 
spark-examples_2.10-1.6.0-SNAPSHOT-test-sources.jar
{code}

I will submit a PR fixing that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13980) Incrementally serialize blocks while unrolling them in MemoryStore

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13980:


Assignee: Apache Spark  (was: Josh Rosen)

> Incrementally serialize blocks while unrolling them in MemoryStore
> --
>
> Key: SPARK-13980
> URL: https://issues.apache.org/jira/browse/SPARK-13980
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> When a block is persisted in the MemoryStore at a serialized storage level, 
> the current MemoryStore.putIterator() code will unroll the entire iterator as 
> Java objects in memory, then will turn around and serialize an iterator 
> obtained from the unrolled array. This is inefficient and doubles our peak 
> memory requirements. Instead, I think that we should incrementally serialize 
> blocks while unrolling them. A downside to incremental serialization is the 
> fact that we will need to deserialize the partially-unrolled data in case 
> there is not enough space to unroll the block and the block cannot be dropped 
> to disk. However, I'm hoping that the memory efficiency improvements will 
> outweigh any performance losses as a result of extra serialization in that 
> hopefully-rare case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14014) Replace existing analysis.Catalog with SessionCatalog

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14014:


Assignee: Andrew Or  (was: Apache Spark)

> Replace existing analysis.Catalog with SessionCatalog
> -
>
> Key: SPARK-14014
> URL: https://issues.apache.org/jira/browse/SPARK-14014
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> As of this moment, there exist many catalogs in Spark. For Spark 2.0, we will 
> have two high level catalogs only: SessionCatalog and ExternalCatalog. 
> SessionCatalog (implemented in SPARK-13923) keeps track of temporary 
> functions and tables and delegates other operations to ExternalCatalog.
> At the same time, there's this legacy catalog called `analysis.Catalog` that 
> also tracks temporary functions and tables. The goal is to get rid of this 
> legacy catalog and replace it with SessionCatalog, which is the new thing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13664:


Assignee: Apache Spark  (was: Michael Armbrust)

> Simplify and Speedup HadoopFSRelation
> -
>
> Key: SPARK-13664
> URL: https://issues.apache.org/jira/browse/SPARK-13664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 2.0.0
>
>
> A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, 
> however there are currently several complexity and performance problems with 
> this code path:
>  - The class mixes the concerns of file management, schema reconciliation, 
> scan building, bucketing, partitioning, and writing data.
>  - For very large tables, we are broadcasting the entire list of files to 
> every executor. [SPARK-11441]
>  - For partitioned tables, we always do an extra projection.  This results 
> not only in a copy, but undoes much of the performance gains that we are 
> going to get from vectorized reads.
> This is an umbrella ticket to track a set of improvements to this codepath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13575) Remove streaming backends' assemblies

2016-03-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13575.

Resolution: Won't Fix

Most streaming backends have been removed from Spark so I'll just close this 
instead.

> Remove streaming backends' assemblies
> -
>
> Key: SPARK-13575
> URL: https://issues.apache.org/jira/browse/SPARK-13575
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Streaming
>Reporter: Marcelo Vanzin
>
> See parent bug for details. This task covers removing assemblies for 
> streaming backends.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202015#comment-15202015
 ] 

Apache Spark commented on SPARK-14006:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/11827

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13908) Limit not pushed down

2016-03-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202604#comment-15202604
 ] 

Liang-Chi Hsieh commented on SPARK-13908:
-

Rethink this issue, I think it should not related to pushdown of limit. Because 
the latest CollectLimit only takes few rows (here is only 1 row) from the 
iterator of data, it should not scan all the data.

> Limit not pushed down
> -
>
> Key: SPARK-13908
> URL: https://issues.apache.org/jira/browse/SPARK-13908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark compiled from git with commit 53ba6d6
>Reporter: Luca Bruno
>  Labels: performance
>
> Hello,
> I'm doing a simple query like this on a single parquet file:
> {noformat}
> SELECT *
> FROM someparquet
> LIMIT 1
> {noformat}
> The someparquet table is just a parquet read and registered as temporary 
> table.
> The query takes as much time (minutes) as it would by scanning all the 
> records, instead of just taking the first record.
> Using parquet-tools head is instead very fast (seconds), hence I guess it's a 
> missing optimization opportunity from spark.
> The physical plan is the following:
> {noformat}
> == Physical Plan ==   
>   
> CollectLimit 1
> +- WholeStageCodegen
>:  +- Scan ParquetFormat part: struct<>, data: struct<>[...] 
> InputPaths: hdfs://...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14018) BenchmarkWholeStageCodegen should accept 64-bit num records

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14018:


Assignee: Reynold Xin  (was: Apache Spark)

> BenchmarkWholeStageCodegen should accept 64-bit num records
> ---
>
> Key: SPARK-14018
> URL: https://issues.apache.org/jira/browse/SPARK-14018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> 500L << 20 is actually pretty close to 32-bit int limit. I was trying to 
> increase this to 500L << 23 and got negative numbers instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13897) GroupedData vs GroupedDataset naming is confusing

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13897:


Assignee: Apache Spark  (was: Reynold Xin)

> GroupedData vs GroupedDataset naming is confusing
> -
>
> Key: SPARK-13897
> URL: https://issues.apache.org/jira/browse/SPARK-13897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Blocker
>
> A placeholder to figure out a better naming scheme for the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13897) GroupedData vs GroupedDataset naming is confusing

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13897:


Assignee: Reynold Xin  (was: Apache Spark)

> GroupedData vs GroupedDataset naming is confusing
> -
>
> Key: SPARK-13897
> URL: https://issues.apache.org/jira/browse/SPARK-13897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> A placeholder to figure out a better naming scheme for the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13897) GroupedData vs GroupedDataset naming is confusing

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202603#comment-15202603
 ] 

Apache Spark commented on SPARK-13897:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11841

> GroupedData vs GroupedDataset naming is confusing
> -
>
> Key: SPARK-13897
> URL: https://issues.apache.org/jira/browse/SPARK-13897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> A placeholder to figure out a better naming scheme for the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14019:


Assignee: Apache Spark

> Remove noop SortOrder in Sort
> -
>
> Key: SPARK-14019
> URL: https://issues.apache.org/jira/browse/SPARK-14019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When SortOrder does not contain any reference, it has no effect on the 
> sorting. Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202599#comment-15202599
 ] 

Apache Spark commented on SPARK-14019:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11840

> Remove noop SortOrder in Sort
> -
>
> Key: SPARK-14019
> URL: https://issues.apache.org/jira/browse/SPARK-14019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When SortOrder does not contain any reference, it has no effect on the 
> sorting. Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14019:


Assignee: (was: Apache Spark)

> Remove noop SortOrder in Sort
> -
>
> Key: SPARK-14019
> URL: https://issues.apache.org/jira/browse/SPARK-14019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When SortOrder does not contain any reference, it has no effect on the 
> sorting. Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13958) Executor OOM due to unbounded growth of pointer array in Sorter

2016-03-19 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-13958:
---

 Summary: Executor OOM due to unbounded growth of pointer array in 
Sorter
 Key: SPARK-13958
 URL: https://issues.apache.org/jira/browse/SPARK-13958
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Sital Kedia


While running a job we saw that the executors are OOMing because in 
UnsafeExternalSorter's growPointerArrayIfNecessary function, we are just 
growing the pointer array indefinitely. 

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L292

This is a regression introduced in PR- 
https://github.com/apache/spark/pull/11095





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13403.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.0.0

> HiveConf used for SparkSQL is not based on the Hadoop configuration
> ---
>
> Key: SPARK-13403
> URL: https://issues.apache.org/jira/browse/SPARK-13403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.0.0
>
>
> The HiveConf instances used by HiveContext are not instantiated by passing in 
> the SparkContext's Hadoop conf and are instead based only on the config files 
> in the environment. Hadoop best practice is to instantiate just one 
> Configuration from the environment and then pass that conf when instantiating 
> others so that modifications aren't lost.
> Spark will set configuration variables that start with "spark.hadoop." from 
> spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not 
> correctly passed to the HiveConf because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201859#comment-15201859
 ] 

Yin Huai commented on SPARK-14006:
--

cc [~shivaram]

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-19 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201828#comment-15201828
 ] 

Dilip Biswal commented on SPARK-13821:
--

[~roycecil] Thanks Roy !!

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13579) Stop building assemblies for Spark

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13579:


Assignee: Apache Spark

> Stop building assemblies for Spark
> --
>
> Key: SPARK-13579
> URL: https://issues.apache.org/jira/browse/SPARK-13579
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See parent bug for more details. This change needs to wait for the other 
> sub-tasks to be finished, so that the code knows what to do when there's only 
> a bunch of jars to work with.
> This should cover both maven and sbt builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13941) kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker

2016-03-19 Thread Hurshal Patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hurshal Patel updated SPARK-13941:
--
Description: 
I am connecting to a Kafka cluster with the following (anonymized) code:

{code}
  var stream = KafkaUtils.createDirectStreamFromZookeeper[String, Array[Byte], 
StringDecoder, DefaultDecoder](
  ssc, kafkaParams, topics)
  stream.foreachRDD { rdd =>
val df = sqlContext.createDataFrame(rdd.map(bytesToString), stringSchema)
df.foreachPartition { partition => 
  val targetNode = chooseTarget(TaskContext.partitionId)
  loadPartition(targetNode, partition)
}
  }
{code}

I am using Kafka 0.8.2.0-1.kafka1.2.0.p0.2 (Cloudera CDH 5.3.1) and Spark 1.4.1 
and this works fine.

After upgrading to Spark 1.5.1, my tasks are failing (stacktrace is below). Are 
there any notable changes to the KafkaDirectStream or KafkaRDD that would cause 
this or does Cloudera's Kafka distribution have known issues with 1.5.1?

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 12407.0 failed 4 times, most recent failure: Lost task 5.3 in stage 
12407.0 (TID 55638, 172.18.203.25): org.apache.spark.SparkException: Couldn't 
connect to leader for topic XXX: java.lang.ClassCastException: 
kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at scala.util.Either.fold(Either.scala:97)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.connectLeader(KafkaRDD.scala:163)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:155)
at org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:135)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at 

[jira] [Commented] (SPARK-13968) Use MurmurHash3 for hashing String features

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202590#comment-15202590
 ] 

Nick Pentreath commented on SPARK-13968:


Ah I didn't pick up the old ticket, thanks.
On Fri, 18 Mar 2016 at 21:29, Joseph K. Bradley (JIRA) 



> Use MurmurHash3 for hashing String features
> ---
>
> Key: SPARK-13968
> URL: https://issues.apache.org/jira/browse/SPARK-13968
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Assignee: Yanbo Liang
>Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Xiao Li (JIRA)
Xiao Li created SPARK-14019:
---

 Summary: Remove noop SortOrder in Sort
 Key: SPARK-14019
 URL: https://issues.apache.org/jira/browse/SPARK-14019
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


When SortOrder does not contain any reference, it has no effect on the sorting. 
Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6