[jira] [Assigned] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14019:


Assignee: (was: Apache Spark)

> Remove noop SortOrder in Sort
> -
>
> Key: SPARK-14019
> URL: https://issues.apache.org/jira/browse/SPARK-14019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When SortOrder does not contain any reference, it has no effect on the 
> sorting. Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202599#comment-15202599
 ] 

Apache Spark commented on SPARK-14019:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11840

> Remove noop SortOrder in Sort
> -
>
> Key: SPARK-14019
> URL: https://issues.apache.org/jira/browse/SPARK-14019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When SortOrder does not contain any reference, it has no effect on the 
> sorting. Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14019:


Assignee: Apache Spark

> Remove noop SortOrder in Sort
> -
>
> Key: SPARK-14019
> URL: https://issues.apache.org/jira/browse/SPARK-14019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When SortOrder does not contain any reference, it has no effect on the 
> sorting. Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13897) GroupedData vs GroupedDataset naming is confusing

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13897:


Assignee: Reynold Xin  (was: Apache Spark)

> GroupedData vs GroupedDataset naming is confusing
> -
>
> Key: SPARK-13897
> URL: https://issues.apache.org/jira/browse/SPARK-13897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> A placeholder to figure out a better naming scheme for the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13897) GroupedData vs GroupedDataset naming is confusing

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202603#comment-15202603
 ] 

Apache Spark commented on SPARK-13897:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11841

> GroupedData vs GroupedDataset naming is confusing
> -
>
> Key: SPARK-13897
> URL: https://issues.apache.org/jira/browse/SPARK-13897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> A placeholder to figure out a better naming scheme for the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13897) GroupedData vs GroupedDataset naming is confusing

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13897:


Assignee: Apache Spark  (was: Reynold Xin)

> GroupedData vs GroupedDataset naming is confusing
> -
>
> Key: SPARK-13897
> URL: https://issues.apache.org/jira/browse/SPARK-13897
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Blocker
>
> A placeholder to figure out a better naming scheme for the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14018) BenchmarkWholeStageCodegen should accept 64-bit num records

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14018:


Assignee: Reynold Xin  (was: Apache Spark)

> BenchmarkWholeStageCodegen should accept 64-bit num records
> ---
>
> Key: SPARK-14018
> URL: https://issues.apache.org/jira/browse/SPARK-14018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> 500L << 20 is actually pretty close to 32-bit int limit. I was trying to 
> increase this to 500L << 23 and got negative numbers instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13908) Limit not pushed down

2016-03-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202604#comment-15202604
 ] 

Liang-Chi Hsieh commented on SPARK-13908:
-

Rethink this issue, I think it should not related to pushdown of limit. Because 
the latest CollectLimit only takes few rows (here is only 1 row) from the 
iterator of data, it should not scan all the data.

> Limit not pushed down
> -
>
> Key: SPARK-13908
> URL: https://issues.apache.org/jira/browse/SPARK-13908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark compiled from git with commit 53ba6d6
>Reporter: Luca Bruno
>  Labels: performance
>
> Hello,
> I'm doing a simple query like this on a single parquet file:
> {noformat}
> SELECT *
> FROM someparquet
> LIMIT 1
> {noformat}
> The someparquet table is just a parquet read and registered as temporary 
> table.
> The query takes as much time (minutes) as it would by scanning all the 
> records, instead of just taking the first record.
> Using parquet-tools head is instead very fast (seconds), hence I guess it's a 
> missing optimization opportunity from spark.
> The physical plan is the following:
> {noformat}
> == Physical Plan ==   
>   
> CollectLimit 1
> +- WholeStageCodegen
>:  +- Scan ParquetFormat part: struct<>, data: struct<>[...] 
> InputPaths: hdfs://...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14019) Remove noop SortOrder in Sort

2016-03-19 Thread Xiao Li (JIRA)
Xiao Li created SPARK-14019:
---

 Summary: Remove noop SortOrder in Sort
 Key: SPARK-14019
 URL: https://issues.apache.org/jira/browse/SPARK-14019
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


When SortOrder does not contain any reference, it has no effect on the sorting. 
Remove the noop SortOrder in Optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13968) Use MurmurHash3 for hashing String features

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202590#comment-15202590
 ] 

Nick Pentreath commented on SPARK-13968:


Ah I didn't pick up the old ticket, thanks.
On Fri, 18 Mar 2016 at 21:29, Joseph K. Bradley (JIRA) 



> Use MurmurHash3 for hashing String features
> ---
>
> Key: SPARK-13968
> URL: https://issues.apache.org/jira/browse/SPARK-13968
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Assignee: Yanbo Liang
>Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13941) kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker

2016-03-19 Thread Hurshal Patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hurshal Patel updated SPARK-13941:
--
Description: 
I am connecting to a Kafka cluster with the following (anonymized) code:

{code}
  var stream = KafkaUtils.createDirectStreamFromZookeeper[String, Array[Byte], 
StringDecoder, DefaultDecoder](
  ssc, kafkaParams, topics)
  stream.foreachRDD { rdd =>
val df = sqlContext.createDataFrame(rdd.map(bytesToString), stringSchema)
df.foreachPartition { partition => 
  val targetNode = chooseTarget(TaskContext.partitionId)
  loadPartition(targetNode, partition)
}
  }
{code}

I am using Kafka 0.8.2.0-1.kafka1.2.0.p0.2 (Cloudera CDH 5.3.1) and Spark 1.4.1 
and this works fine.

After upgrading to Spark 1.5.1, my tasks are failing (stacktrace is below). Are 
there any notable changes to the KafkaDirectStream or KafkaRDD that would cause 
this or does Cloudera's Kafka distribution have known issues with 1.5.1?

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 12407.0 failed 4 times, most recent failure: Lost task 5.3 in stage 
12407.0 (TID 55638, 172.18.203.25): org.apache.spark.SparkException: Couldn't 
connect to leader for topic XXX: java.lang.ClassCastException: 
kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at scala.util.Either.fold(Either.scala:97)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.connectLeader(KafkaRDD.scala:163)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:155)
at org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:135)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at 

[jira] [Assigned] (SPARK-13579) Stop building assemblies for Spark

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13579:


Assignee: Apache Spark

> Stop building assemblies for Spark
> --
>
> Key: SPARK-13579
> URL: https://issues.apache.org/jira/browse/SPARK-13579
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See parent bug for more details. This change needs to wait for the other 
> sub-tasks to be finished, so that the code knows what to do when there's only 
> a bunch of jars to work with.
> This should cover both maven and sbt builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-19 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201828#comment-15201828
 ] 

Dilip Biswal commented on SPARK-13821:
--

[~roycecil] Thanks Roy !!

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201859#comment-15201859
 ] 

Yin Huai commented on SPARK-14006:
--

cc [~shivaram]

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13403.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.0.0

> HiveConf used for SparkSQL is not based on the Hadoop configuration
> ---
>
> Key: SPARK-13403
> URL: https://issues.apache.org/jira/browse/SPARK-13403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.0.0
>
>
> The HiveConf instances used by HiveContext are not instantiated by passing in 
> the SparkContext's Hadoop conf and are instead based only on the config files 
> in the environment. Hadoop best practice is to instantiate just one 
> Configuration from the environment and then pass that conf when instantiating 
> others so that modifications aren't lost.
> Spark will set configuration variables that start with "spark.hadoop." from 
> spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not 
> correctly passed to the HiveConf because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13958) Executor OOM due to unbounded growth of pointer array in Sorter

2016-03-19 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-13958:
---

 Summary: Executor OOM due to unbounded growth of pointer array in 
Sorter
 Key: SPARK-13958
 URL: https://issues.apache.org/jira/browse/SPARK-13958
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Sital Kedia


While running a job we saw that the executors are OOMing because in 
UnsafeExternalSorter's growPointerArrayIfNecessary function, we are just 
growing the pointer array indefinitely. 

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L292

This is a regression introduced in PR- 
https://github.com/apache/spark/pull/11095





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198585#comment-15198585
 ] 

Xiao Li commented on SPARK-13862:
-

I will also take this. Thanks!

> TPCDS query 49 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13862
> URL: https://issues.apache.org/jira/browse/SPARK-13862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 49 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> store,9797,0.8000,2,2]
> [store,12641,0.81609195402298850575,3,3]
> [store,6661,0.92207792207792207792,7,7]
> [store,13013,0.94202898550724637681,8,8]
> [store,9029,1.,10,10]
> [web,15597,0.66197183098591549296,3,3]
> [store,14925,0.96470588235294117647,9,9]
> [store,4063,1.,10,10]
> [catalog,8929,0.7625,7,7]
> [store,11589,0.82653061224489795918,6,6]
> [store,1171,0.82417582417582417582,5,5]
> [store,9471,0.7750,1,1]
> [catalog,12577,0.65591397849462365591,3,3]
> [web,97,0.90361445783132530120,9,8]
> [web,85,0.85714285714285714286,8,7]
> [catalog,361,0.74647887323943661972,5,5]
> [web,2915,0.69863013698630136986,4,4]
> [web,117,0.9250,10,9]
> [catalog,9295,0.77894736842105263158,9,9]
> [web,3305,0.7375,6,16]
> [catalog,16215,0.79069767441860465116,10,10]
> [web,7539,0.5900,1,1]
> [catalog,17543,0.57142857142857142857,1,1]
> [catalog,3411,0.71641791044776119403,4,4]
> [web,11933,0.71717171717171717172,5,5]
> [catalog,14513,0.63541667,2,2]
> [store,15839,0.81632653061224489796,4,4]
> [web,3337,0.62650602409638554217,2,2]
> [web,5299,0.92708333,11,10]
> [catalog,8189,0.74698795180722891566,6,6]
> [catalog,14869,0.77173913043478260870,8,8]
> [web,483,0.8000,7,6]
> {noformat}
> Expected results:
> {noformat}
> +-+---++-+---+
> | CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
> +-+---++-+---+
> | catalog | 17543 |  .5714285714285714 |   1 | 1 |
> | catalog | 14513 |  .63541666 |   2 | 2 |
> | catalog | 12577 |  .6559139784946236 |   3 | 3 |
> | catalog |  3411 |  .7164179104477611 |   4 | 4 |
> | catalog |   361 |  .7464788732394366 |   5 | 5 |
> | catalog |  8189 |  .7469879518072289 |   6 | 6 |
> | catalog |  8929 |  .7625 |   7 | 7 |
> | catalog | 14869 |  .7717391304347826 |   8 | 8 |
> | catalog |  9295 |  .7789473684210526 |   9 | 9 |
> | catalog | 16215 |  .7906976744186046 |  10 |10 |
> | store   |  9471 |  .7750 |   1 | 1 |
> | store   |  9797 |  .8000 |   2 | 2 |
> | store   | 12641 |  .8160919540229885 |   3 | 3 |
> | store   | 15839 |  .8163265306122448 |   4 | 4 |
> | store   |  1171 |  .8241758241758241 |   5 | 5 |
> | store   | 11589 |  .8265306122448979 |   6 | 6 |
> | store   |  6661 |  .9220779220779220 |   7 | 7 |
> | store   | 13013 |  .9420289855072463 |   8 | 8 |
> | store   | 14925 |  .9647058823529411 |   9 | 9 |
> | store   |  4063 | 1. |  10 |10 |
> | store   |  9029 | 1. |  10 |10 |
> | web |  7539 |  .5900 |   1 | 1 |
> | web |  3337 |  .6265060240963855 |   2 | 2 |
> | web | 15597 |  .6619718309859154 |   3 | 3 |
> | web |  2915 |  .6986301369863013 |   4 | 4 |
> | web | 11933 |  .7171717171717171 |   5 | 5 |
> | web |  3305 |  .7375 |   6 |16 |
> | web |   483 |  .8000 |   7 | 6 |
> | web |85 |  .8571428571428571 |   8 | 7 |
> | web |97 |  .9036144578313253 |   9 | 8 |
> | web |   117 |  .9250 |  10 | 9 |
> | web |  5299 |  .92708333 |  11 |10 |
> 

[jira] [Commented] (SPARK-13986) Make `DeveloperApi`-annotated things public

2016-03-19 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200525#comment-15200525
 ] 

Timothy Hunter commented on SPARK-13986:


[~dongjoon] how did you find the conflicting annotation? It would be great to 
automate this as part of the style checks

> Make `DeveloperApi`-annotated things public
> ---
>
> Key: SPARK-13986
> URL: https://issues.apache.org/jira/browse/SPARK-13986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict 
> with its visibility. This issue proposes to fix those conflict. The following 
> is the example.
> {code:title=JobResult.scala|borderStyle=solid}
> @DeveloperApi
> sealed trait JobResult
> @DeveloperApi
> case object JobSucceeded extends JobResult
> @DeveloperApi
> -private[spark] case class JobFailed(exception: Exception) extends JobResult
> +case class JobFailed(exception: Exception) extends JobResult
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13041) Add a driver history ui link and a mesos sandbox link on the dispatcher's ui page for each driver

2016-03-19 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-13041:

Comment: was deleted

(was: There is a requirement for: "history server links in the dispatcher ui" 
as well.)

> Add a driver history ui link and a mesos sandbox link on the dispatcher's ui 
> page for each driver
> -
>
> Key: SPARK-13041
> URL: https://issues.apache.org/jira/browse/SPARK-13041
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> It would be convenient to have the driver's history uri from the history 
> server and the driver's mesos sandbox uri on the dispatcher's ui.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2016-03-19 Thread Hiten Patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198556#comment-15198556
 ] 

Hiten Patel edited comment on SPARK-5594 at 3/17/16 1:31 AM:
-

Yes, this was indeed the problem in my case too. 

I had a custom function all in the same Scala object (Spark Streaming job) with 
input (Kafka direct streaming) and output (to Cassandra). 

After moving the function to a seperate object that extends Serializable, it 
did work finally !!. A BIG thanks to Sal Uryasev for reporting the solution. 

It does make sense after realizing the proper solution on how indeed it should 
work in Spark. However, it's really vexing to nail down such issues when you 
hit this the first time. I think this is more of a coding practice and needs to 
be documented in the official documentation as  one of the coding practices. 
It''s very easy to get trapped into such issues for anyone.

Let me know the right place where this needs to go in documentation and I can 
send a PR. 




was (Author: echelon_apache):
Yes, this was indeed the problem in my case too. 

I had a custom function all in the same Scala object (Spark Streaming job) with 
input (Kafka direct streaming) and output (to Cassandra). 

After moving the function to a seperate object that extends Serializable, it 
did work finally !!. A BIG thanks to Sal Uryasev for reporting the solution. 

It does make sense after realizing the proper solution on how indeed it should 
work in Spark. However, it's really vexing to nail down such issues since the 
stacktraces are too generic and could mean lot of things. I think this is more 
of a coding practice and needs to be documented in the official documentation 
as  one of the coding practices. It''s very easy to get trapped into such 
issues for anyone.

Let me know the right place where this needs to go in documentation and I can 
send a PR. 



> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> 

[jira] [Assigned] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14021:


Assignee: (was: Apache Spark)

> Support custom context derived from HiveContext for SparkSQLEnv
> ---
>
> Key: SPARK-14021
> URL: https://issues.apache.org/jira/browse/SPARK-14021
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13827) Can't add subquery to an operator with same-name outputs while generate SQL string

2016-03-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13827.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11658
[https://github.com/apache/spark/pull/11658]

> Can't add subquery to an operator with same-name outputs while generate SQL 
> string
> --
>
> Key: SPARK-13827
> URL: https://issues.apache.org/jira/browse/SPARK-13827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13952) spark.ml GBT algs need to use random seed

2016-03-19 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201703#comment-15201703
 ] 

Seth Hendrickson commented on SPARK-13952:
--

Yes, I can work on it.

> spark.ml GBT algs need to use random seed
> -
>
> Key: SPARK-13952
> URL: https://issues.apache.org/jira/browse/SPARK-13952
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SPARK-12379 copied the GBT implementation from spark.mllib to spark.ml.  
> There was one bug I found: The random seed is not used.  A reasonable fix 
> will be to use the original seed to generate a new seed for each tree trained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13954) spar-shell starts with exceptions

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13954.
---
Resolution: Not A Problem

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA. This is not a bug. You need to create the directory you have 
specified as the parent dir for logs dirs. It's not necessarily reasonable for 
Spark to create it and set ACLs on it.

> spar-shell starts with exceptions
> -
>
> Key: SPARK-13954
> URL: https://issues.apache.org/jira/browse/SPARK-13954
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: Mac OS
>Reporter: Pranas Baliuka
>Priority: Minor
>
> To reproduce:
> Download {{spark-1.6.1-bin-hadoop2.6.tgz}}
> Create {{conf/spark-defaults.conf}}
> {code}
> spark.eventLog.enabled   true
> {code}
> {code}
> n:bin s$ bin./spark-shell
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's repl log4j profile: 
> org/apache/spark/log4j-defaults-repl.properties
> To adjust logging level use sc.setLogLevel("INFO")
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/03/17 14:21:16 ERROR SparkContext: Error initializing SparkContext.
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
>   at org.apache.spark.SparkContext.(SparkContext.scala:549)
>   at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
>   at $line3.$read$$iwC$$iwC.(:15)
>   at $line3.$read$$iwC.(:24)
>   at $line3.$read.(:26)
>   at $line3.$read$.(:30)
>   at $line3.$read$.()
>   at $line3.$eval$.(:7)
>   at $line3.$eval$.()
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:125)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
>   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
>   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
>   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
>   at 
> org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
>   at 
> 

[jira] [Commented] (SPARK-913) log the size of each shuffle block in block manager

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201280#comment-15201280
 ] 

Apache Spark commented on SPARK-913:


User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/11819

> log the size of each shuffle block in block manager
> ---
>
> Key: SPARK-913
> URL: https://issues.apache.org/jira/browse/SPARK-913
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Reynold Xin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13979) Killed executor is respawned without AWS keys in standalone spark cluster

2016-03-19 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200183#comment-15200183
 ] 

Mitesh commented on SPARK-13979:


I'm seeing this too. Its really annoying because I set the s3 access and secret 
keys in all places that the docs specify:

{noformat}
sparkConf.hadoopConf.set("fs.s3n.awsAccessKeyId", ..)
sparkConf.set("spark.hadoop.fs.s3n.awsAccessKeyId", ..)
sparkConf.set("spark.hadoop.cloneConf", true)
/core-site.xml   fs.s3n.awsAccessKeyId
/spark-env.sh   export AWS_ACCESS_KEY_ID = ...
{noformat}

None of that seems to work. If I kill a running executor, it comes back up and 
doesnt have the keys anymore.

> Killed executor is respawned without AWS keys in standalone spark cluster
> -
>
> Key: SPARK-13979
> URL: https://issues.apache.org/jira/browse/SPARK-13979
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: I'm using Spark 1.5.2 with Hadoop 2.7 and running 
> experiments on a simple standalone cluster:
> 1 master
> 2 workers
> All ubuntu 14.04 with Java 8/Scala 2.10
>Reporter: Allen George
>
> I'm having a problem where respawning a failed executor during a job that 
> reads/writes parquet on S3 causes subsequent tasks to fail because of missing 
> AWS keys.
> h4. Setup:
> I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple 
> standalone cluster:
> 1 master
> 2 workers
> My application is co-located on the master machine, while the two workers are 
> on two other machines (one worker per machine). All machines are running in 
> EC2. I've configured my setup so that my application executes its task on two 
> executors (one executor per worker).
> h4. Application:
> My application reads and writes parquet files on S3. I set the AWS keys on 
> the SparkContext by doing:
> val sc = new SparkContext()
> val hadoopConf = sc.hadoopConfiguration
> hadoopConf.set("fs.s3n.awsAccessKeyId", "SOME_KEY")
> hadoopConf.set("fs.s3n.awsSecretAccessKey", "SOME_SECRET")
> At this point I'm done, and I go ahead and use "sc".
> h4. Issue:
> I can read and write parquet files without a problem with this setup. *BUT* 
> if an executor dies during a job and is respawned by a worker, tasks fail 
> with the following error:
> "Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret 
> Access Key must be specified as the username or password (respectively) of a 
> s3n URL, or by setting the {{fs.s3n.awsAccessKeyId}} or 
> {{fs.s3n.awsSecretAccessKey}} properties (respectively)."
> h4. Basic analysis
> I think I've traced this down to the following:
> SparkHadoopUtil is initialized with an empty {{SparkConf}}. Later, classes 
> like {{DataSourceStrategy}} simply call {{SparkHadoopUtil.get.conf}} and 
> access the (now invalid; missing various properties) {{HadoopConfiguration}} 
> that's built from this empty {{SparkConf}} object. It's unclear to me why 
> this is done, and it seems that the code as written would cause broken 
> results anytime callers use {{SparkHadoopUtil.get.conf}} directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv

2016-03-19 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-14021:

Description: This is to create a custom context for command bin/spark-sql 
and sbin/start-thriftserver. Any context that is derived from HiveContext is 
acceptable. User need to configure the class name of custom context in a config 
of spark.sql.context.class, and make sure the class in classpath. This is to 
provide a more elegant way for custom configurations and changes for 
infrastructure team.

> Support custom context derived from HiveContext for SparkSQLEnv
> ---
>
> Key: SPARK-14021
> URL: https://issues.apache.org/jira/browse/SPARK-14021
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>
> This is to create a custom context for command bin/spark-sql and 
> sbin/start-thriftserver. Any context that is derived from HiveContext is 
> acceptable. User need to configure the class name of custom context in a 
> config of spark.sql.context.class, and make sure the class in classpath. This 
> is to provide a more elegant way for custom configurations and changes for 
> infrastructure team.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13948) MiMa Check should catch if the visibility change to `private`

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13948.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

>   MiMa Check should catch if the visibility change to `private`
> --
>
> Key: SPARK-13948
> URL: https://issues.apache.org/jira/browse/SPARK-13948
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Dongjoon Hyun
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.0
>
>
> `GenerateMIMAIgnore.scala` makes `.generated-mima-class-excludes` from the 
> current code having `private` class. As a result, it ignores the case : 
> visibility goes from public into private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14021:


Assignee: Apache Spark

> Support custom context derived from HiveContext for SparkSQLEnv
> ---
>
> Key: SPARK-14021
> URL: https://issues.apache.org/jira/browse/SPARK-14021
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14005) Make RDD more compatible with Scala's collection

2016-03-19 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202658#comment-15202658
 ] 

zhengruifeng commented on SPARK-14005:
--

I think easiness to implement should not be the reason to ignore convenience.

> Make RDD more compatible with Scala's collection 
> -
>
> Key: SPARK-14005
> URL: https://issues.apache.org/jira/browse/SPARK-14005
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Reporter: zhengruifeng
>Priority: Trivial
>
> How about implementing some more methods for RDD to make it more compatible 
> with Scala's collection?
> Such as:
> nonEmpty, slice, takeRight, contains, last, reverse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2016-03-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201270#comment-15201270
 ] 

Sean Owen commented on SPARK-3249:
--

It's definitely still an issue. I remember trying to fix this and found it had 
something to do with unidoc not been able to handle cross-module links. But if 
you can fix any of the warnings, or conclude the rest can't be, yes please do 
so.

> Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`
> -
>
> Key: SPARK-3249
> URL: https://issues.apache.org/jira/browse/SPARK-3249
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> If there are multiple overloaded versions of a method, we should make the 
> links more specific. Otherwise, `sbt/sbt unidoc` generates warning messages 
> like the following:
> {code}
> [warn] 
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala:305: The 
> link target "org.apache.spark.mllib.tree.DecisionTree$#trainClassifier" is 
> ambiguous. Several members fit the target:
> [warn] (input: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer],impurity: 
> String,maxDepth: Int,maxBins: Int): 
> org.apache.spark.mllib.tree.model.DecisionTreeModel in object DecisionTree 
> [chosen]
> [warn] (input: 
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: Map[Int,Int],impurity: String,maxDepth: 
> Int,maxBins: Int): org.apache.spark.mllib.tree.model.DecisionTreeModel in 
> object DecisionTree
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14000) case class with a tuple field can't work in Dataset

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14000:


Assignee: (was: Apache Spark)

> case class with a tuple field can't work in Dataset
> ---
>
> Key: SPARK-14000
> URL: https://issues.apache.org/jira/browse/SPARK-14000
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>
> for example, `case class TupleClass(data: (Int, String))`, we can create 
> encoder for it, but when we create Dataset with it, we will fail while 
> validating the encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202319#comment-15202319
 ] 

Apache Spark commented on SPARK-13951:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11835

> PySpark ml.pipeline support export/import - nested Piplines
> ---
>
> Key: SPARK-13951
> URL: https://issues.apache.org/jira/browse/SPARK-13951
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13970) Add Non-Negative Matrix Factorization to MLlib

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13970:


Assignee: (was: Apache Spark)

> Add Non-Negative Matrix Factorization to MLlib
> --
>
> Key: SPARK-13970
> URL: https://issues.apache.org/jira/browse/SPARK-13970
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> NMF is to find two non-negative matrices (W, H) whose product W * H.T 
> approximates the non-negative matrix X. This factorization can be used for 
> example for dimensionality reduction, source separation or topic extraction.
> NMF was implemented in several packages:
> Scikit-Learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)
> R-NMF (https://cran.r-project.org/web/packages/NMF/index.html)
> LibNMF (http://www.univie.ac.at/rlcta/software/)
> I have implemented in MLlib according to the following papers:
> Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data 
> Analysis on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf)
> Algorithms for Non-negative Matrix Factorization 
> (http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)
> It can be used like this:
> val m = 4
> val n = 3
> val data = Seq(
> (0L, Vectors.dense(0.0, 1.0, 2.0)),
> (1L, Vectors.dense(3.0, 4.0, 5.0)),
> (3L, Vectors.dense(9.0, 0.0, 1.0))
>   ).map(x => IndexedRow(x._1, x._2))
> val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix()
> val k = 2
> // run the nmf algo
> val r = NMF.solve(A, k, 10)
> val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
> >>> org.apache.spark.mllib.linalg.DenseMatrix =
> 1.1349295096806706  1.4423101890626953E-5
> 3.453054133110303   0.46312492493865615
> 0.0 0.0
> 0.3133764134585149  2.70684017255672
> val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
> >>> org.apache.spark.mllib.linalg.DenseMatrix =
> 0.4184163313845057  3.2719352525149286
> 1.121880126136450.002939823716977737
> 1.456499371939653   0.18992996116069297
> val R = rW.multiply(rH.transpose)
> >>> org.apache.spark.mllib.linalg.DenseMatrix =
> 0.4749202332761286  1.2732549038779071.6530268574248572
> 2.9601290106732367  3.8752743120480346   5.117332475154927
> 0.0 0.0  0.0
> 8.987727592773672   0.35952840319637736  0.9705425982249293
> val AD = A.toBlockMatrix().toLocalMatrix()
> >>> org.apache.spark.mllib.linalg.Matrix =
> 0.0  1.0  2.0
> 3.0  4.0  5.0
> 0.0  0.0  0.0
> 9.0  0.0  1.0
> var loss = 0.0
> for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) {
>val diff = AD(i, j) - R(i, j)
>loss += diff * diff
> }
> loss
> >>> Double = 0.5817999580912183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13948) MiMa Check should catch if the visibility change to `private`

2016-03-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13948:
---
Assignee: (was: Josh Rosen)

>   MiMa Check should catch if the visibility change to `private`
> --
>
> Key: SPARK-13948
> URL: https://issues.apache.org/jira/browse/SPARK-13948
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> `GenerateMIMAIgnore.scala` makes `.generated-mima-class-excludes` from the 
> current code having `private` class. As a result, it ignores the case : 
> visibility goes from public into private. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13986) Make `DeveloperApi`-annotated things public

2016-03-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200559#comment-15200559
 ] 

Dongjoon Hyun edited comment on SPARK-13986 at 3/17/16 10:53 PM:
-

Oh, that's a great idea. For this, I found them by just regular-expression 
search in IntelliJ.

For style checking, we need to add a `ScalaStyle` rule and Maven CheckStyle 
rule since I've seen some `@DeveloperApi` in Java code, too.

By the way, Maven CheckStyle checking is not ran automatically as of today.


was (Author: dongjoon):
Oh, that's a great idea. For this, I found just regular-expression search in 
IntelliJ.

For style checking, we need to add a `ScalaStyle` rule and Maven CheckStyle 
rule since I've seen some `@DeveloperApi` in Java code, too.

By the way, Maven CheckStyle checking is not ran automatically as of today.

> Make `DeveloperApi`-annotated things public
> ---
>
> Key: SPARK-13986
> URL: https://issues.apache.org/jira/browse/SPARK-13986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict 
> with its visibility. This issue proposes to fix those conflict. The following 
> is the example.
> {code:title=JobResult.scala|borderStyle=solid}
> @DeveloperApi
> sealed trait JobResult
> @DeveloperApi
> case object JobSucceeded extends JobResult
> @DeveloperApi
> -private[spark] case class JobFailed(exception: Exception) extends JobResult
> +case class JobFailed(exception: Exception) extends JobResult
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv

2016-03-19 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-14021:
---

 Summary: Support custom context derived from HiveContext for 
SparkSQLEnv
 Key: SPARK-14021
 URL: https://issues.apache.org/jira/browse/SPARK-14021
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13838) Clear variable code to prevent it to be re-evaluated in BoundAttribute

2016-03-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13838.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11674
[https://github.com/apache/spark/pull/11674]

> Clear variable code to prevent it to be re-evaluated in BoundAttribute
> --
>
> Key: SPARK-13838
> URL: https://issues.apache.org/jira/browse/SPARK-13838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.0.0
>
>
> We should also clear the variable code in BoundReference.genCode to prevent 
> it to be evaluated twice, as we did in evaluateVariables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201257#comment-15201257
 ] 

Nick Pentreath commented on SPARK-13969:


What I have in mind is something like the following:

{code}
// class FeatureHasher extends Transformer ...

val df = sqlContext.createDataFrame(Seq(
(3.5, "foo", Seq("woo", "woo")), 
(5.3, "bar", Seq("baz", "baz".toDF("real", "categorical", "raw_text")
df.show
// ++---+--+
// |real|categorical|  raw_text|
// ++---+--+
// | 3.5|foo|[woo, woo]|
// | 5.3|bar|[baz, baz]|
// ++---+--+

val hasher = new FeatureHasher()
val result = hasher.transform(df)
result.show(false)
// +--+
// |features  |
// +--+
// |(10,[3,5,6],[1.0,2.0,3.5])|
// |(10,[1,6,9],[1.0,5.3,2.0])|
// +--+

// numerical columns are handled by hashing column name to get vector index, 
and using the value as the feature value
// string columns are handled by treating the features as categorical, hashing 
the feature value (or "column_name=feature_value") to get index and setting 
value = 1.0
// string sequence columns are handled in the same way as HashingTF currently, 
i.e. same as for categorical but allowing for counts
{code}

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197975#comment-15197975
 ] 

Apache Spark commented on SPARK-12719:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/11768

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13863) TPCDS query 66 returns wrong results compared to TPC official result set

2016-03-19 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200149#comment-15200149
 ] 

JESSE CHEN commented on SPARK-13863:


Going to validate this also on my cluster. Nice find.

> TPCDS query 66 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13863
> URL: https://issues.apache.org/jira/browse/SPARK-13863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 66 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> Aggregations slightly off -- eg. JAN_SALES column of "Doors canno"  row - 
> SparkSQL returns 6355232.185385704, expected 6355232.31
> Actual results:
> {noformat}
> [null,null,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9597806.850651741,1.1121820530080795E7,8670867.81564045,8994785.945689201,1.088724806326294E7,1.4187671518377304E7,9732598.460139751,1.9798897020946026E7,2.1007842467959404E7,2.149551364927292E7,3.479566905774999E7,3.3122997954660416E7,null,null,null,null,null,null,null,null,null,null,null,null,2.191359469742E7,3.2518476414670944E7,2.48856624883976E7,2.5698343830046654E7,3.373591080598068E7,3.552703167087555E7,2.5465193481492043E7,5.362323870799959E7,5.1409986978201866E7,5.415917383586836E7,9.222704311805725E7,8.343539111531019E7]
> [Bad cards must make.,621234,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,9506753.593884468,8008140.429557085,6116769.711647987,1.1973045160133362E7,7756254.925520897,5352978.574095726,1.373399613500309E7,1.6418794411203384E7,1.7212743279764652E7,1.704270732417488E7,3.43049358570323E7,3.532416421229005E7,15.30301560102066,12.890698882477594,9.846160563729589,19.273003667109915,12.485238936569628,8.61668642427125,22.107605403121994,26.429323590150222,27.707342611261865,27.433635834765774,55.22063482847413,56.86128610521969,3.0534943928382874E7,2.4481686250203133E7,2.217871080008793E7,2.569579825610423E7,2.995490355044937E7,1.8084140250833035E7,3.0805576178061485E7,4.7156887432252884E7,5.115858869637826E7,5.5759943171424866E7,8.625354428184557E7,8.345155532035494E7]
> [Conventional childr,977787,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,8860645.460736752,1.441581376543355E7,6761497.232810497,1.1820654735879421E7,8246260.600341797,6636877.482845306,1.1434492123092413E7,2.5673812070380323E7,2.307420611785E7,2.1834582007320404E7,2.6894900596512794E7,3.357509177109933E7,9.061938296108202,14.743306840276613,6.9151024024767125,12.08919195681618,8.43359606984118,6.787651587559771,11.694256645969329,26.257060147435304,23.598398219562938,22.330611889215547,27.505888906799534,34.337838170377935,2.3836085704864502E7,3.20733132298584E7,2.503790437837982E7,2.2659895963564873E7,2.175740087420273E7,2.4451608012176514E7,2.1933001734852314E7,5.59967034604629E7,5.737188052299309E7,6.208721474336243E7,8.284991027382469E7,8.897031933202875E7]
> [Doors canno,294242,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,6355232.185385704,1.0198920296742141E7,1.0246200903741479E7,1.2209716492156029E7,8566998.262890816,8806316.75278151,9789405.6993227,1.646658496404171E7,2.6443785668474197E7,2.701604788320923E7,3.366058958298761E7,2.7462468750599384E7,21.59865751791282,34.66167405313361,34.822360178837414,41.495491779406166,29.115484067165177,29.928823053070296,33.26991285854059,55.96272783641258,89.87087386734116,91.81574310672585,114.39763726112386,93.33293258813964,2.2645142994330406E7,2.448725452685547E7,2.4925759290207863E7,3.0503655031727314E7,2.6558160276379585E7,2.0976233452690125E7,2.9895796101181984E7,5.600219855566597E7,5.348815865275085E7,7.628723580410767E7,8.248374754962921E7,8.808826726185608E7]
> [Important issues liv,138504,Fairview,Williamson County,TN,United 
> States,DHL,BARIAN,2001,1.1748784594717264E7,1.435130566355586E7,9896470.867572784,7990874.805492401,8879247.840401173,7362383.04259038,1.0011144724414349E7,1.7741201390372872E7,2.1346976135887742E7,1.8074978020030975E7,2.967512567988676E7,3.2545325348875403E7,84.8263197793368,103.6165429414014,71.45259969078715,57.694180713137534,64.10824120892663,53.156465102743454,72.28054586448297,128.09161750110374,154.12534032149065,130.5014874662896,214.25464737398747,234.97751219369408,2.7204167203903973E7,2.598037822457385E7,1.9943398915802002E7,2.5710421112384796E7,1.948448105346489E7,2.6346611484448195E7,2.5075158296625137E7,5.409477817043829E7,4.106673223178029E7,5.454705814340496E7,7.246596285337901E7,9.277032812079096E7]
> {noformat}
> Expected results:
> {noformat}
> 

[jira] [Commented] (SPARK-13958) Executor OOM due to unbounded growth of pointer array in Sorter

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200360#comment-15200360
 ] 

Apache Spark commented on SPARK-13958:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/11794

> Executor OOM due to unbounded growth of pointer array in Sorter
> ---
>
> Key: SPARK-13958
> URL: https://issues.apache.org/jira/browse/SPARK-13958
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a job we saw that the executors are OOMing because in 
> UnsafeExternalSorter's growPointerArrayIfNecessary function, we are just 
> growing the pointer array indefinitely. 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L292
> This is a regression introduced in PR- 
> https://github.com/apache/spark/pull/11095



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13987) Build fails due to scala version mismatch between

2016-03-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200539#comment-15200539
 ] 

Jean-Baptiste Onofré commented on SPARK-13987:
--

Nevermind, it works fine following the README. Sorry about the noise.

> Build fails due to scala version mismatch between 
> --
>
> Key: SPARK-13987
> URL: https://issues.apache.org/jira/browse/SPARK-13987
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Jean-Baptiste Onofré
>
> Build fails on master, due to test fails in launcher:
> {code}
> Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.046 sec <<< 
> FAILURE! - in org.apache.spark.launcher.SparkSubmitCommandBuilderSuite
> testExamplesRunner(org.apache.spark.launcher.SparkSubmitCommandBuilderSuite)  
> Time elapsed: 0.01 sec  <<< ERROR!
> java.lang.IllegalStateException: Examples jars directory 
> '/home/jbonofre/Workspace/spark/examples/target/scala-2.11/jars' does not 
> exist.
> at 
> org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.buildCommand(SparkSubmitCommandBuilderSuite.java:307)
> at 
> org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testExamplesRunner(SparkSubmitCommandBuilderSuite.java:164)
> {code}
> The reason is that the scala version by default in example is still 2.10, so 
> the file names don't match:
> {code}
> spark/examples/target$ ls -l|grep -i 2.10
> drwxrwxr-x 4 jbonofre jbonofre4096 Feb  3 16:39 scala-2.10
> -rw-rw-r-- 1 jbonofre jbonofre 1899057 Feb  3 16:39 
> spark-examples_2.10-1.6.0-SNAPSHOT.jar
> -rw-rw-r-- 1 jbonofre jbonofre 1320517 Feb  3 16:40 
> spark-examples_2.10-1.6.0-SNAPSHOT-javadoc.jar
> -rw-rw-r-- 1 jbonofre jbonofre  390527 Feb  3 16:40 
> spark-examples_2.10-1.6.0-SNAPSHOT-sources.jar
> -rw-rw-r-- 1 jbonofre jbonofre   12333 Feb  3 16:39 
> spark-examples_2.10-1.6.0-SNAPSHOT-tests.jar
> -rw-rw-r-- 1 jbonofre jbonofre8875 Feb  3 16:40 
> spark-examples_2.10-1.6.0-SNAPSHOT-test-sources.jar
> {code}
> I will submit a PR fixing that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13963) Add binary toggle Param to ml.HashingTF

2016-03-19 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-13963:
--

 Summary: Add binary toggle Param to ml.HashingTF
 Key: SPARK-13963
 URL: https://issues.apache.org/jira/browse/SPARK-13963
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Nick Pentreath
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13965) Driver should kill the other running task attempts if any one task attempt succeeds for the same task

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13965:


Assignee: Apache Spark

> Driver should kill the other running task attempts if any one task attempt 
> succeeds for the same task
> -
>
> Key: SPARK-13965
> URL: https://issues.apache.org/jira/browse/SPARK-13965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Devaraj K
>Assignee: Apache Spark
>
> When we enable speculation, Driver would launch additional attempts for the 
> same task if it founds that attempt is progressing slow compared to other 
> tasks average progress and then there will be multiple task attempts in 
> running state.
> At present, if any one attempt gets succeeded others would be keep running 
> (even they could run till the job completion) and cannot be given these slots 
> to other tasks in same stage or in next stages. 
> We can kill these running task attempts when any other attempt gets succeeded 
> and can be given the slots to run other tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13942) Remove Shark-related docs for 2.x

2016-03-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13942:
--
Description: 
`Shark` was merged into `Spark SQL` since [July 
2014|https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html].
 The followings seem to be the only legacy. For Spark 2.x, we had better clean 
up those docs.

*Migration Guide*
{code:title=sql-programming-guide.md|borderStyle=solid}
- ## Migration Guide for Shark Users
- ...
- ### Scheduling
- ...
- ### Reducer number
- ...
- ### Caching
{code}


  was:
`Shark` was merged into `Spark SQL` since [July 
2014|https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html].
 

The followings seem to be the only legacy.

*Migration Guide*
{code:title=sql-programming-guide.md|borderStyle=solid}
- ## Migration Guide for Shark Users
- ...
- ### Scheduling
- ...
- ### Reducer number
- ...
- ### Caching
{code}

*SparkEnv visibility and comments*
{code:title=SparkEnv.scala|borderStyle=solid}
- *
- * NOTE: This is not intended for external use. This is exposed for Shark and 
may be made private
- *   in a future release.
  */
 @DeveloperApi
-class SparkEnv (
+private[spark] class SparkEnv (
{code}

For Spark 2.x, we had better clean up those docs and comments in any way. 
However, the visibility of `SparkEnv` class might be controversial. 

At the first attempt, this issue proposes to change both stuff according to the 
note(`This is exposed for Shark`). During review process, the change on 
visibility might be removed.

Component/s: (was: Spark Core)

> Remove Shark-related docs for 2.x
> -
>
> Key: SPARK-13942
> URL: https://issues.apache.org/jira/browse/SPARK-13942
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `Shark` was merged into `Spark SQL` since [July 
> 2014|https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html].
>  The followings seem to be the only legacy. For Spark 2.x, we had better 
> clean up those docs.
> *Migration Guide*
> {code:title=sql-programming-guide.md|borderStyle=solid}
> - ## Migration Guide for Shark Users
> - ...
> - ### Scheduling
> - ...
> - ### Reducer number
> - ...
> - ### Caching
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13862) TPCDS query 49 returns wrong results compared to TPC official result set

2016-03-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198595#comment-15198595
 ] 

Xiao Li commented on SPARK-13862:
-

Nope. The PR is still open. https://github.com/apache/spark/pull/10731

Not sure if this will be merged to 2.0. [~rxin]

> TPCDS query 49 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13862
> URL: https://issues.apache.org/jira/browse/SPARK-13862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 49 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL has right answer but in wrong order (and there is an 'order by' in 
> the query).
> Actual results:
> {noformat}
> store,9797,0.8000,2,2]
> [store,12641,0.81609195402298850575,3,3]
> [store,6661,0.92207792207792207792,7,7]
> [store,13013,0.94202898550724637681,8,8]
> [store,9029,1.,10,10]
> [web,15597,0.66197183098591549296,3,3]
> [store,14925,0.96470588235294117647,9,9]
> [store,4063,1.,10,10]
> [catalog,8929,0.7625,7,7]
> [store,11589,0.82653061224489795918,6,6]
> [store,1171,0.82417582417582417582,5,5]
> [store,9471,0.7750,1,1]
> [catalog,12577,0.65591397849462365591,3,3]
> [web,97,0.90361445783132530120,9,8]
> [web,85,0.85714285714285714286,8,7]
> [catalog,361,0.74647887323943661972,5,5]
> [web,2915,0.69863013698630136986,4,4]
> [web,117,0.9250,10,9]
> [catalog,9295,0.77894736842105263158,9,9]
> [web,3305,0.7375,6,16]
> [catalog,16215,0.79069767441860465116,10,10]
> [web,7539,0.5900,1,1]
> [catalog,17543,0.57142857142857142857,1,1]
> [catalog,3411,0.71641791044776119403,4,4]
> [web,11933,0.71717171717171717172,5,5]
> [catalog,14513,0.63541667,2,2]
> [store,15839,0.81632653061224489796,4,4]
> [web,3337,0.62650602409638554217,2,2]
> [web,5299,0.92708333,11,10]
> [catalog,8189,0.74698795180722891566,6,6]
> [catalog,14869,0.77173913043478260870,8,8]
> [web,483,0.8000,7,6]
> {noformat}
> Expected results:
> {noformat}
> +-+---++-+---+
> | CHANNEL |  ITEM |   RETURN_RATIO | RETURN_RANK | CURRENCY_RANK |
> +-+---++-+---+
> | catalog | 17543 |  .5714285714285714 |   1 | 1 |
> | catalog | 14513 |  .63541666 |   2 | 2 |
> | catalog | 12577 |  .6559139784946236 |   3 | 3 |
> | catalog |  3411 |  .7164179104477611 |   4 | 4 |
> | catalog |   361 |  .7464788732394366 |   5 | 5 |
> | catalog |  8189 |  .7469879518072289 |   6 | 6 |
> | catalog |  8929 |  .7625 |   7 | 7 |
> | catalog | 14869 |  .7717391304347826 |   8 | 8 |
> | catalog |  9295 |  .7789473684210526 |   9 | 9 |
> | catalog | 16215 |  .7906976744186046 |  10 |10 |
> | store   |  9471 |  .7750 |   1 | 1 |
> | store   |  9797 |  .8000 |   2 | 2 |
> | store   | 12641 |  .8160919540229885 |   3 | 3 |
> | store   | 15839 |  .8163265306122448 |   4 | 4 |
> | store   |  1171 |  .8241758241758241 |   5 | 5 |
> | store   | 11589 |  .8265306122448979 |   6 | 6 |
> | store   |  6661 |  .9220779220779220 |   7 | 7 |
> | store   | 13013 |  .9420289855072463 |   8 | 8 |
> | store   | 14925 |  .9647058823529411 |   9 | 9 |
> | store   |  4063 | 1. |  10 |10 |
> | store   |  9029 | 1. |  10 |10 |
> | web |  7539 |  .5900 |   1 | 1 |
> | web |  3337 |  .6265060240963855 |   2 | 2 |
> | web | 15597 |  .6619718309859154 |   3 | 3 |
> | web |  2915 |  .6986301369863013 |   4 | 4 |
> | web | 11933 |  .7171717171717171 |   5 | 5 |
> | web |  3305 |  .7375 |   6 |16 |
> | web |   483 |  .8000 |   7 | 6 |
> | web |85 |  .8571428571428571 |   8 | 7 |
> | web |97 |  .9036144578313253 |   9 | 8 |
> | web |   117 |  .9250 |  10 | 9 |
> | web |  5299 |  .92708333 |   

[jira] [Resolved] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-12719.
-
Resolution: Fixed

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13984) Schema verification always fail when using remote Hive metastore

2016-03-19 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202191#comment-15202191
 ] 

Rekha Joshi commented on SPARK-13984:
-

Hi [~Jianfeng Hu] Implicitly modifying schema is disabled by default in hive.As 
you have set schema.verification to true, it first checks hive binaries is 
compatible with metastore schema version.If not it raises this error.You can 
set hive.metastore.schema.verification to false, to suppress this check., and 
let metastore implicitly modify the schema.Anyhow related to hive itself.

It is not a Spark issue.If you and [~srowen] agree, this can be closed.thanks.

> Schema verification always fail when using remote Hive metastore
> 
>
> Key: SPARK-13984
> URL: https://issues.apache.org/jira/browse/SPARK-13984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jianfeng Hu
>
> Launch a Hive metastore Thrift server, and then in hive-site.xml:
> - set hive.metastore.uris
> - set hive.metastore.schema.verification to true
> Run spark-shell, it will fail with:
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> Caused by: MetaException(message:Version information not found in metastore.
> It might be using the wrong metastore (could be possibly the local derby 
> metastore) when doing the verification? Not exactly sure on this but maybe 
> could someone more familiar with the HiveContext code help investigating?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13977) Bring back ShuffledHashJoin

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13977:


Assignee: Davies Liu  (was: Apache Spark)

> Bring back ShuffledHashJoin
> ---
>
> Key: SPARK-13977
> URL: https://issues.apache.org/jira/browse/SPARK-13977
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> ShuffledHashJoin is still useful when:
> 1) any partition of the build side could fit in memory
> 2) the build side is much smaller than stream side, the building hash table 
> on smaller side should be faster than sorting the bigger side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13934) SqlParser.parseTableIdentifier cannot recognize table name start with scientific notation

2016-03-19 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197285#comment-15197285
 ] 

Herman van Hovell commented on SPARK-13934:
---

We overhauled much of the parser infrastructure (including parsing 
TableIdentifiers). Could you try this on master first?

> SqlParser.parseTableIdentifier cannot recognize table name start with 
> scientific notation
> -
>
> Key: SPARK-13934
> URL: https://issues.apache.org/jira/browse/SPARK-13934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yang Wang
>
> SqlParser.parseTableIdentifier cannot recognize table name start with 
> scientific notation like "1e30abcdedfg".
> This bug can be reproduced by code following:
> val conf = new SparkConf().setAppName(s"test").setMaster("local[2]")
> val sc = new SparkContext(conf)
> val hc = new HiveContext(sc)
> val tableName = "1e34abcd"
> hc.sql("select 123").registerTempTable(tableName)
> hc.dropTempTable(tableName)
> The last line will throw a RuntimeException.(java.lang.RuntimeException: 
> [1.1] failure: identifier expected)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13951:


Assignee: (was: Apache Spark)

> PySpark ml.pipeline support export/import - nested Piplines
> ---
>
> Key: SPARK-13951
> URL: https://issues.apache.org/jira/browse/SPARK-13951
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-19 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-14006:

Comment: was deleted

(was: pushing a pull request in few mins.thanks!)

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201313#comment-15201313
 ] 

Apache Spark commented on SPARK-14004:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11820

> AttributeReference and Alias should only use their first qualifier to build 
> SQL representations
> ---
>
> Key: SPARK-14004
> URL: https://issues.apache.org/jira/browse/SPARK-14004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Current implementation joins all qualifiers, which is wrong.
> However, this doesn't cause any real SQL generation bugs as there is always 
> at most one qualifier for any given {{AttributeReference}} or {{Alias}}.
> We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to 
> represent qualifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13987) Build fails due to scala version mismatch between

2016-03-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Baptiste Onofré closed SPARK-13987.

Resolution: Not A Problem

> Build fails due to scala version mismatch between 
> --
>
> Key: SPARK-13987
> URL: https://issues.apache.org/jira/browse/SPARK-13987
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Jean-Baptiste Onofré
>
> Build fails on master, due to test fails in launcher:
> {code}
> Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.046 sec <<< 
> FAILURE! - in org.apache.spark.launcher.SparkSubmitCommandBuilderSuite
> testExamplesRunner(org.apache.spark.launcher.SparkSubmitCommandBuilderSuite)  
> Time elapsed: 0.01 sec  <<< ERROR!
> java.lang.IllegalStateException: Examples jars directory 
> '/home/jbonofre/Workspace/spark/examples/target/scala-2.11/jars' does not 
> exist.
> at 
> org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.buildCommand(SparkSubmitCommandBuilderSuite.java:307)
> at 
> org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testExamplesRunner(SparkSubmitCommandBuilderSuite.java:164)
> {code}
> The reason is that the scala version by default in example is still 2.10, so 
> the file names don't match:
> {code}
> spark/examples/target$ ls -l|grep -i 2.10
> drwxrwxr-x 4 jbonofre jbonofre4096 Feb  3 16:39 scala-2.10
> -rw-rw-r-- 1 jbonofre jbonofre 1899057 Feb  3 16:39 
> spark-examples_2.10-1.6.0-SNAPSHOT.jar
> -rw-rw-r-- 1 jbonofre jbonofre 1320517 Feb  3 16:40 
> spark-examples_2.10-1.6.0-SNAPSHOT-javadoc.jar
> -rw-rw-r-- 1 jbonofre jbonofre  390527 Feb  3 16:40 
> spark-examples_2.10-1.6.0-SNAPSHOT-sources.jar
> -rw-rw-r-- 1 jbonofre jbonofre   12333 Feb  3 16:39 
> spark-examples_2.10-1.6.0-SNAPSHOT-tests.jar
> -rw-rw-r-- 1 jbonofre jbonofre8875 Feb  3 16:40 
> spark-examples_2.10-1.6.0-SNAPSHOT-test-sources.jar
> {code}
> I will submit a PR fixing that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12183) Remove spark.mllib tree, forest implementations and use spark.ml

2016-03-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-12183:
-

Assignee: Joseph K. Bradley

> Remove spark.mllib tree, forest implementations and use spark.ml
> 
>
> Key: SPARK-12183
> URL: https://issues.apache.org/jira/browse/SPARK-12183
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> This JIRA is for replacing the spark.mllib decision tree and random forest 
> implementations with the one from spark.ml.  The spark.ml one should be used 
> as a wrapper.  This should involve moving the implementation, but should 
> probably not require changing the tests (much).
> This blocks on 1 improvement to spark.mllib which needs to be ported to 
> spark.ml: [SPARK-10064]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13935) Other clients' connection hang up when someone do huge load

2016-03-19 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-13935:
-
Affects Version/s: 1.6.0
   1.6.1

> Other clients' connection hang up when someone do huge load
> ---
>
> Key: SPARK-13935
> URL: https://issues.apache.org/jira/browse/SPARK-13935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.2, 1.6.0, 1.6.1
>Reporter: Tao Wang
>Priority: Critical
>
> We run a sql like "insert overwrite table store_returns partition 
> (sr_returned_date) select xx" using beeline then it will block other 
> beeline connection while invoke the Hive method via 
> "ClientWrapper.loadDynamicPartitions".
> The reason is that "withHiveState" will lock "clientLoader". Sadly when a new 
> client comes, it will invoke "setConf" methods which is also sychronized with 
> "clientLoader".
> So the problem is that if the first sql took very long time to run, then all 
> other client could not connect to thrift server successfully.
> We tested on release 1.5.1. not sure if latest release has same issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13972) hive tests should fail if SQL generation failed

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201820#comment-15201820
 ] 

Apache Spark commented on SPARK-13972:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/11825

> hive tests should fail if SQL generation failed
> ---
>
> Key: SPARK-13972
> URL: https://issues.apache.org/jira/browse/SPARK-13972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13986) Remove `DeveloperApi`-annotation for non-publics

2016-03-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13986:
--
Summary: Remove `DeveloperApi`-annotation for non-publics  (was: Make 
`DeveloperApi`-annotated things public)

> Remove `DeveloperApi`-annotation for non-publics
> 
>
> Key: SPARK-13986
> URL: https://issues.apache.org/jira/browse/SPARK-13986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict 
> with its visibility. This issue proposes to fix those conflict. The following 
> is the example.
> {code:title=JobResult.scala|borderStyle=solid}
> @DeveloperApi
> sealed trait JobResult
> @DeveloperApi
> case object JobSucceeded extends JobResult
> @DeveloperApi
> -private[spark] case class JobFailed(exception: Exception) extends JobResult
> +case class JobFailed(exception: Exception) extends JobResult
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text

2016-03-19 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-13982:
-

 Summary: SparkR - KMeans predict: Output column name of features 
is an unclear, automatic genetared text
 Key: SPARK-13982
 URL: https://issues.apache.org/jira/browse/SPARK-13982
 Project: Spark
  Issue Type: Bug
Reporter: Narine Kokhlikyan
Priority: Minor


Currently KMean-predict's features' output column name is set to something like 
this: "vecAssembler_522ba59ea239__output", which is the default output column 
name of the "VectorAssembler".
Example: 
showDF(predict(model, training)) shows something like this:

DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
prediction:int]

This name is automatically generated and very unclear from user perspective.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13955) Spark in yarn mode fails

2016-03-19 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199927#comment-15199927
 ] 

Marcelo Vanzin edited comment on SPARK-13955 at 3/17/16 4:59 PM:
-

How did you build Spark? Did you do "mvn package" or "sbt assembly"? I see the 
code is trying to upload "file:/Users/jzhang/github/spark/lib/", do you have a 
file called RELEASE on your repo's root? (That triggers a different code path, 
since that file is only expected to exist in a Spark distribution, not in a 
Spark source repo.)


was (Author: vanzin):
How did you build Spark? Did you do "mvn package" or "sbt assembly"? I see the 
code is trying to uploade "file:/Users/jzhang/github/spark/lib/", do you have a 
file called RELEASE on your repo's root? (That triggers a different code path, 
since that file is only expected to exist in a Spark distribution, not in a 
Spark source repo.)

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-03-19 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199927#comment-15199927
 ] 

Marcelo Vanzin commented on SPARK-13955:


How did you build Spark? Did you do "mvn package" or "sbt assembly"? I see the 
code is trying to uploade "file:/Users/jzhang/github/spark/lib/", do you have a 
file called RELEASE on your repo's root? (That triggers a different code path, 
since that file is only expected to exist in a Spark distribution, not in a 
Spark source repo.)

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13970) Add Non-Negative Matrix Factorization to MLlib

2016-03-19 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13970:


 Summary: Add Non-Negative Matrix Factorization to MLlib
 Key: SPARK-13970
 URL: https://issues.apache.org/jira/browse/SPARK-13970
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: zhengruifeng
Priority: Minor


NMF is to find two non-negative matrices (W, H) whose product W * H.T 
approximates the non-negative matrix X. This factorization can be used for 
example for dimensionality reduction, source separation or topic extraction.

NMF was implemented in several packages:
Scikit-Learn 
(http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)
R-NMF (https://cran.r-project.org/web/packages/NMF/index.html)
LibNMF (http://www.univie.ac.at/rlcta/software/)

I have implemented in MLlib according to the following papers:
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis 
on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf)
Algorithms for Non-negative Matrix Factorization 
(http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)

It can be used like this:

val m = 4
val n = 3
val data = Seq(
(0L, Vectors.dense(0.0, 1.0, 2.0)),
(1L, Vectors.dense(3.0, 4.0, 5.0)),
(3L, Vectors.dense(9.0, 0.0, 1.0))
  ).map(x => IndexedRow(x._1, x._2))

val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix()
val k = 2

// run the nmf algo
val r = NMF.solve(A, k, 10)

val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
1.1349295096806706  1.4423101890626953E-5
3.453054133110303   0.46312492493865615
0.0 0.0
0.3133764134585149  2.70684017255672

val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4184163313845057  3.2719352525149286
1.121880126136450.002939823716977737
1.456499371939653   0.18992996116069297


val R = rW.multiply(rH.transpose)
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4749202332761286  1.2732549038779071.6530268574248572
2.9601290106732367  3.8752743120480346   5.117332475154927
0.0 0.0  0.0
8.987727592773672   0.35952840319637736  0.9705425982249293

val AD = A.toBlockMatrix().toLocalMatrix()
>>> org.apache.spark.mllib.linalg.Matrix =
0.0  1.0  2.0
3.0  4.0  5.0
0.0  0.0  0.0
9.0  0.0  1.0

var loss = 0.0
for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) {
   val diff = AD(i, j) - R(i, j)
   loss += diff * diff
}
loss
>>> Double = 0.5817999580912183





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-03-19 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200940#comment-15200940
 ] 

Sun Rui commented on SPARK-12148:
-

An issue reported in the Spark user list may be related to this naming conflict.
{code}
countData <- matrix(1:100,ncol=4)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

Works if i dont initialize the sparkR environment. 
 if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives 
following error 

> dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ 
> condition)
Error in DataFrame(colData, row.names = rownames(colData)) : 
  cannot coerce class "data.frame" to a DataFrame
{code}

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14001) support multi-children Union in SQLBuilder

2016-03-19 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-14001:
---

 Summary: support multi-children Union in SQLBuilder
 Key: SPARK-14001
 URL: https://issues.apache.org/jira/browse/SPARK-14001
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13458) Datasets cannot be sorted

2016-03-19 Thread Oliver Beattie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201112#comment-15201112
 ] 

Oliver Beattie commented on SPARK-13458:


Yes, it looks like these methods were added a week ago for inclusion in 2.0.

https://github.com/apache/spark/commit/1d542785b9949e7f92025e6754973a779cc37c52

Thanks a lot.

> Datasets cannot be sorted
> -
>
> Key: SPARK-13458
> URL: https://issues.apache.org/jira/browse/SPARK-13458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oliver Beattie
>
> There doesn't appear to be any way to sort a {{Dataset}} at present, without 
> first converting it to a {{DataFrame}}.
> Methods like {{orderBy}}, {{sort}}, and {{sortWithinPartitions}} which are 
> present on {{DataFrame}}, or {{sortBy}} which is present on {{RDD}}, are 
> absent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13427) Support USING clause in JOIN

2016-03-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13427.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11297
[https://github.com/apache/spark/pull/11297]

> Support USING clause in JOIN
> 
>
> Key: SPARK-13427
> URL: https://issues.apache.org/jira/browse/SPARK-13427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
> Fix For: 2.0.0
>
>
> Support queries that JOIN tables with USING clause.
> SELECT * from table1 JOIN table2 USING 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations

2016-03-19 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-14004:
--

 Summary: AttributeReference and Alias should only use their first 
qualifier to build SQL representations
 Key: SPARK-14004
 URL: https://issues.apache.org/jira/browse/SPARK-14004
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Current implementation joins all qualifiers, which is wrong.

However, this doesn't cause any real SQL generation bugs as there is always at 
most one qualifier for any given {{AttributeReference}} or {{Alias}}.

We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to 
represent qualifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13996) Add more not null attributes for Filter codegen

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13996:


Assignee: Apache Spark

> Add more not null attributes for Filter codegen
> ---
>
> Key: SPARK-13996
> URL: https://issues.apache.org/jira/browse/SPARK-13996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Filter codegen finds the attributes not null by checking IsNotNull(a) 
> expression with a condition {{if child.output.contains(a)}}. However, the 
> current approach to checking it is not comprehensive. We can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199136#comment-15199136
 ] 

Dilip Biswal commented on SPARK-13865:
--

[~smilegator] Quick update on this ..

This also seems related to to null safe equal issue. I just put a comment 
[spark-13859|https://issues.apache.org/jira/browse/SPARK-13859]

Here is the output of the query with expected count after doing similar 
modification.

{code}
spark-sql> select count(*)
 > from 
 >  (select distinct c_last_name as cln1, c_first_name as cfn1, 
d_date as ddate1, 1 as notnull1
 >from store_sales
 > JOIN date_dim ON store_sales.ss_sold_date_sk <=> 
date_dim.d_date_sk
 > JOIN customer ON store_sales.ss_customer_sk <=> 
customer.c_customer_sk
 >where
 >  d_month_seq between 1200 and 1200+11
 >) tmp1
 >left outer join
 >   (select distinct c_last_name as cln2, c_first_name as cfn2, 
d_date as ddate2, 1 as notnull2
 >from catalog_sales
 > JOIN date_dim ON catalog_sales.cs_sold_date_sk <=> 
date_dim.d_date_sk
 > JOIN customer ON catalog_sales.cs_bill_customer_sk <=> 
customer.c_customer_sk
 >where 
 >  d_month_seq between 1200 and 1200+11
 >) tmp2 
 >   on (tmp1.cln1 <=> tmp2.cln2)
 >   and (tmp1.cfn1 <=> tmp2.cfn2)
 >   and (tmp1.ddate1<=> tmp2.ddate2)
 >left outer join
 >   (select distinct c_last_name as cln3, c_first_name as cfn3 , 
d_date as ddate3, 1 as notnull3
 >from web_sales
 > JOIN date_dim ON web_sales.ws_sold_date_sk <=> 
date_dim.d_date_sk
 > JOIN customer ON web_sales.ws_bill_customer_sk <=> 
customer.c_customer_sk
 >where 
 >  d_month_seq between 1200 and 1200+11
 >) tmp3 
 >   on (tmp1.cln1 <=> tmp3.cln3)
 >   and (tmp1.cfn1 <=> tmp3.cfn3)
 >   and (tmp1.ddate1<=> tmp3.ddate3)
 > where  
 > notnull2 is null and notnull3 is null;
47298   
Time taken: 13.561 seconds, Fetched 1 row(s)

{code}

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp3 
>   on (tmp1.cln1 = tmp3.cln3)
>   and (tmp1.cfn1 = tmp3.cfn3)
>   and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-13955) Spark in yarn mode fails

2016-03-19 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13955:
---
Description: 
I ran spark-shell in yarn client, but from the logs seems the spark assembly 
jar is not uploaded to HDFS. This may be known issue in the process of 
SPARK-11157, create this ticket to track this issue. [~vanzin]
{noformat}
16/03/17 11:58:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
16/03/17 11:58:59 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.10.jar
16/03/17 11:58:59 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.11.jar
16/03/17 11:59:00 INFO Client: Uploading resource 
file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-36cacbad-ca5b-482b-8ca8-607499acaaba/__spark_conf__4427292248554277597.zip
 -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/__spark_conf__4427292248554277597.zip
{noformat}

message in AM container
{noformat}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{noformat}

  was:
Seems the spark assembly jar is not uploaded to AM. This may be known issue in 
the process of SPARK-11157, create this ticket to track this issue. [~vanzin]
{noformat}
16/03/17 11:58:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
16/03/17 11:58:59 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.10.jar
16/03/17 11:58:59 INFO Client: Uploading resource 
file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.11.jar
16/03/17 11:59:00 INFO Client: Uploading resource 
file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-36cacbad-ca5b-482b-8ca8-607499acaaba/__spark_conf__4427292248554277597.zip
 -> 
hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/__spark_conf__4427292248554277597.zip
{noformat}

message in AM container
{noformat}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{noformat}


> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 11:58:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 11:58:59 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.10.jar
> 16/03/17 11:58:59 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/apache-rat-0.11.jar
> 16/03/17 11:59:00 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-36cacbad-ca5b-482b-8ca8-607499acaaba/__spark_conf__4427292248554277597.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0004/__spark_conf__4427292248554277597.zip
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-19 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200188#comment-15200188
 ] 

Mark Grover commented on SPARK-13877:
-

Yeah, that totally makes sense. I agree that it's a big change but I also think 
we can't really keep the same package name if this code moves out of Apache 
Spark.

So should we mark this as Won't Fix then?

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13915) Allow bin/spark-submit to be called via symbolic link

2016-03-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13915.
---
Resolution: Not A Problem

> Allow bin/spark-submit to be called via symbolic link
> -
>
> Key: SPARK-13915
> URL: https://issues.apache.org/jira/browse/SPARK-13915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
> Environment: CentOS 6.6
> Tarbal spark distribution and CDH-5.x.x Spark version (both)
>Reporter: Rafael Pecin Ferreira
>Priority: Minor
>
> We have a CDH-5 cluster that comes with spark-1.5.0 and we needed to use 
> spark-1.5.1 for bug fix issues.
> When I set up the spark (out of the CDH box) to the system alternatives, it 
> created a sequence of symbolic links to the target spark installation.
> When I tried to run spark-submit, the bash process call the target with "$0" 
> as /usr/bin/spark-submit, but this script use the "$0" variable to locate its 
> deps and I was facing this messages:
> [hdfs@server01 ~]$ env spark-submit
> ls: cannot access /usr/assembly/target/scala-2.10: No such file or directory
> Failed to find Spark assembly in /usr/assembly/target/scala-2.10.
> You need to build Spark before running this program.
> I fixed the spark-submit script adding this lines:
> if [ -h "$0" ] ; then
> checklink="$0";
> while [ -h "$checklink" ] ; do
> checklink=`readlink $checklink`
> done
> SPARK_HOME="$(cd "`dirname "$checklink"`"/..; pwd)";
> else
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)";
> fi
> It would be very nice if this piece of code be put into the spark-submit 
> script to allow us to have multiple spark alternatives on the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13981) Improve Filter generated code to defer variable evaluation within operator

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13981:


Assignee: Apache Spark

> Improve Filter generated code to defer variable evaluation within operator
> --
>
> Key: SPARK-13981
> URL: https://issues.apache.org/jira/browse/SPARK-13981
> Project: Spark
>  Issue Type: Improvement
>Reporter: Nong Li
>Assignee: Apache Spark
>Priority: Minor
>
> We can improve the generated filter code by deferring evaluating variables 
> until just before they are needed.
> For example. x > 1 and y > b
> we can do
> {code}
> x = ...
> if (x <= 1) continue
> y = ...
> {code}
> instead of currently where we do
> {code}
> x = ...
> y = ...
> if (x <= 1) continue
> ...
> {code}
> This is helpful if evaluating y has any cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14006) Builds of 1.6 branch fail R style check

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14006:


Assignee: (was: Apache Spark)

> Builds of 1.6 branch fail R style check
> ---
>
> Key: SPARK-14006
> URL: https://issues.apache.org/jira/browse/SPARK-14006
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-1.6-test-sbt-hadoop-2.2/152/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13989) Remove non-vectorized/unsafe-row parquet record reader

2016-03-19 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-13989:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-14008

> Remove non-vectorized/unsafe-row parquet record reader
> --
>
> Key: SPARK-13989
> URL: https://issues.apache.org/jira/browse/SPARK-13989
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
>
> Clean up the new parquet record reader by removing the non-vectorized parquet 
> reader code from `UnsafeRowParquetRecordReader`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13313) Strongly connected components doesn't find all strongly connected components

2016-03-19 Thread Petar Zecevic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200028#comment-15200028
 ] 

Petar Zecevic commented on SPARK-13313:
---

Ok, thanks for reporting. I'll look into this. 

> Strongly connected components doesn't find all strongly connected components
> 
>
> Key: SPARK-13313
> URL: https://issues.apache.org/jira/browse/SPARK-13313
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13289:


Assignee: (was: Apache Spark)

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-19 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199689#comment-15199689
 ] 

Tien-Dung LE commented on SPARK-13932:
--

The error is still there in the latest spark code version 2.0.0-SNAPSHOT

> CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
> --
>
> Key: SPARK-13932
> URL: https://issues.apache.org/jira/browse/SPARK-13932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Tien-Dung LE
>
> A complex aggregate query using condition in the aggregate function and GROUP 
> BY HAVING clause raises an exception. This issue only happens in Spark 
> version 1.6.+ but not in Spark 1.5.+.
> Here is a typical error message {code}
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> {code}
> Here is a code snippet to re-produce the error in a spark-shell session:
> {code}
> import sqlContext.implicits._
> case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
>   b: Int = (math.random*1e3).toInt,
>   n: Int = (math.random*1e3).toInt,
>   m: Double = (math.random*1e3))
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
> df.registerTempTable( "toto" )
> val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS 
> k3, GROUPING__ID"
> val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS 
> k2, SUM(m) AS k3, GROUPING__ID"
> val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"
> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> {code}
> And here is the full log
> {code}
> scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
> res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
> res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at 

[jira] [Updated] (SPARK-14003) Multi-session can not work when one session is moving files for "INSERT ... SELECT" clause

2016-03-19 Thread Weizhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong updated SPARK-14003:
-
Summary: Multi-session can not work when one session is moving files for 
"INSERT ... SELECT" clause  (was: Multi-session can not work when run one 
session is running "INSERT ... SELECT" move files step)

> Multi-session can not work when one session is moving files for "INSERT ... 
> SELECT" clause
> --
>
> Key: SPARK-14003
> URL: https://issues.apache.org/jira/browse/SPARK-14003
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Weizhong
>Priority: Critical
>
> 1. Start ThriftServer
> 2. beeline A connect to ThriftServer, run "INSERT INTO ... SELECT"
> 3. when beeline A job is finished and start moving files, beeline B connect 
> to ThriftServer, then it will pending
> This is because we run client.loadDynamicPatitions in withHiveState, so it 
> will locked, then beeline B can't get the lock until beeline A finished 
> moving files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3308) Ability to read JSON Arrays as tables

2016-03-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198543#comment-15198543
 ] 

Hyukjin Kwon commented on SPARK-3308:
-

I removed the PR link, https://github.com/apache/spark/pull/11752 because a new 
JIRA was linked for that.

> Ability to read JSON Arrays as tables
> -
>
> Key: SPARK-3308
> URL: https://issues.apache.org/jira/browse/SPARK-3308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now we can only read json where each object is on its own line.  It 
> would be nice to be able to read top level json arrays where each element 
> maps to a row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13858) TPCDS query 21 returns wrong results compared to TPC official result set

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13858:


Assignee: (was: Apache Spark)

> TPCDS query 21 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13858
> URL: https://issues.apache.org/jira/browse/SPARK-13858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |  2039 |
> | Bad cards 

[jira] [Commented] (SPARK-13963) Add binary toggle Param to ml.HashingTF

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200047#comment-15200047
 ] 

Nick Pentreath commented on SPARK-13963:


Sure, assigned to you.

> Add binary toggle Param to ml.HashingTF
> ---
>
> Key: SPARK-13963
> URL: https://issues.apache.org/jira/browse/SPARK-13963
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Bryan Cutler
>Priority: Trivial
>
> It would be handy to add a binary toggle Param to {{HashingTF}}, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13969) Extend input format that feature hashing can handle

2016-03-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13969:
---
Description: 
Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
strings and computes term frequencies.

The use cases for feature hashing extend to arbitrary feature values (binary, 
count or real-valued). For example, scikit-learn's {{FeatureHasher}} can accept 
a sequence of (feature_name, value) pairs (e.g. a map, list). In this way, 
feature hashing can operate as both "one-hot encoder" and "vector assembler" at 
the same time.

Investigate adding a more generic feature hasher (that in turn can be used by 
{{HashingTF}}).

  was:
Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
strings and computes term frequencies.

The use cases for feature hashing extend to arbitrary feature values (binary, 
count or real-valued). For example, scikit-learn's {{FeatureHasher}} can accept 
a sequence of (feature_name, value) pairs (e.g. a map, list). In this way, 
feature hashing can operate as both "one-hot encoder" and "vector assembler" at 
the same time.

Investigate adding a more generic feature hasher (that in turn can be used by 
{{HashingTF}}.


> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13957) Support group by position in SQL

2016-03-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13957:
---

 Summary: Support group by position in SQL
 Key: SPARK-13957
 URL: https://issues.apache.org/jira/browse/SPARK-13957
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3, sum(*) from tbl group by by 1, 3, c4
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3, sum(*) from tbl order by c1, c3, c4
{noformat}

We only convert integer literals (not foldable expressions).

For positions that are aggregate functions, an analysis exception should be 
thrown, e.g. in postgres;

{noformat}

rxin=# select 'one', 'two', count(*) from r1 group by 1, 3;
ERROR:  aggregate functions are not allowed in GROUP BY
LINE 1: select 'one', 'two', count(*) from r1 group by 1, 3;
 ^
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13969) Extend input format that feature hashing can handle

2016-03-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202007#comment-15202007
 ] 

Joseph K. Bradley edited comment on SPARK-13969 at 3/18/16 7:35 PM:


I think HashingTF could be extended to handle this in two steps:
* Handle more input types [SPARK-11107]
* Accept multiple input columns [SPARK-8418] (This mentions using RFormula 
syntax towards the end, but this linked JIRA is really for the optimized 
implementation, which might use HasInputCols instead.)


was (Author: josephkb):
I think HashingTF could be extended to handle this in two steps:
* Handle more input types [SPARK-11107]
* Accept multiple input columns [SPARK-8418]

> Extend input format that feature hashing can handle
> ---
>
> Key: SPARK-13969
> URL: https://issues.apache.org/jira/browse/SPARK-13969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently {{HashingTF}} works like {{CountVectorizer}} (the equivalent in 
> scikit-learn is {{HashingVectorizer}}). That is, it works on a sequence of 
> strings and computes term frequencies.
> The use cases for feature hashing extend to arbitrary feature values (binary, 
> count or real-valued). For example, scikit-learn's {{FeatureHasher}} can 
> accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this 
> way, feature hashing can operate as both "one-hot encoder" and "vector 
> assembler" at the same time.
> Investigate adding a more generic feature hasher (that in turn can be used by 
> {{HashingTF}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error

2016-03-19 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197944#comment-15197944
 ] 

kevin yu commented on SPARK-13831:
--

The same query will fail at spark sql 2.0 . And the failure can simply to 
select c_customer_sk from customer where exists (select cr_refunded_customer_sk 
from catalog_returns)

or 

select c_customer_sk from customer where exists (select cr_refunded_customer_sk 
from catalog_returns where cr_refunded_customer_sk = customer.c_customer_sk)

in Hive, it can pass the syntax. 
[~davies] can you confirm that spark sql is not supporting subquery with exist 
yet? 

> TPC-DS Query 35 fails with the following compile error
> --
>
> Key: SPARK-13831
> URL: https://issues.apache.org/jira/browse/SPARK-13831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Roy Cecil
>
> TPC-DS Query 35 fails with the following compile error.
> Scala.NotImplementedError: 
> scala.NotImplementedError: No parse rules for ASTNode type: 864, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR 1, 439,797, 1370
>   TOK_SUBQUERY_OP 1, 439,439, 1370
> exists 1, 439,439, 1370
>   TOK_QUERY 1, 441,797, 1508
> Pasting Query 35 for easy reference.
> select
>   ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   count(*) cnt1,
>   min(cd_dep_count) cd_dep_count1,
>   max(cd_dep_count) cd_dep_count2,
>   avg(cd_dep_count) cd_dep_count3,
>   cd_dep_employed_count,
>   count(*) cnt2,
>   min(cd_dep_employed_count) cd_dep_employed_count1,
>   max(cd_dep_employed_count) cd_dep_employed_count2,
>   avg(cd_dep_employed_count) cd_dep_employed_count3,
>   cd_dep_college_count,
>   count(*) cnt3,
>   min(cd_dep_college_count) cd_dep_college_count1,
>   max(cd_dep_college_count) cd_dep_college_count2,
>   avg(cd_dep_college_count) cd_dep_college_count3
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN
>   (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_qoy < 4) ss_wh1
>   ON c.c_customer_sk = ss_wh1.ss_customer_sk
>  where
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk  as customer_sk
> from web_sales,date_dim
> where
>   ws_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>UNION ALL
> select cs_ship_customer_sk  as customer_sk
> from catalog_sales,date_dim
> where
>   cs_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11011) UserDefinedType serialization should be strongly typed

2016-03-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11011.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11379
[https://github.com/apache/spark/pull/11379]

> UserDefinedType serialization should be strongly typed
> --
>
> Key: SPARK-11011
> URL: https://issues.apache.org/jira/browse/SPARK-11011
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: John Muller
>Priority: Minor
>  Labels: UDT
> Fix For: 2.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> UDT's serialize method takes an Any rather than the actual type parameter.  
> The issue lies in CatalystTypeConverters convertToCatalyst(a: Any): Any 
> method, which pattern matches against a hardcoded list of built-in SQL types.
> Planned fix is to allow the UDT to supply the CatalystTypeConverter to use 
> via a new public method on the abstract class UserDefinedType that allows the 
> implementer to strongly type those conversions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13978) [GSoC 2016] Build monitoring UI and infrastructure for Spark SQL and structured streaming

2016-03-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13978:
-
Labels: GSOC2016 mentor  (was: GSOC2016)

> [GSoC 2016] Build monitoring UI and infrastructure for Spark SQL and 
> structured streaming
> -
>
> Key: SPARK-13978
> URL: https://issues.apache.org/jira/browse/SPARK-13978
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Yin Huai
>  Labels: GSOC2016, mentor
>
> Will provide more details later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13938) word2phrase feature created in ML

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197890#comment-15197890
 ] 

Apache Spark commented on SPARK-13938:
--

User 's4weng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11766

> word2phrase feature created in ML
> -
>
> Key: SPARK-13938
> URL: https://issues.apache.org/jira/browse/SPARK-13938
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Steve Weng
>Priority: Critical
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> I implemented word2phrase (see http://arxiv.org/pdf/1310.4546.pdf) which 
> transforms a sentence of words into one where certain individual consecutive 
> words are concatenated by using a training model/estimator (e.g. "I went to 
> New York" becomes "I went to new_york").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13997) Use Hadoop 2.0 default value for compression in data sources

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200984#comment-15200984
 ] 

Apache Spark commented on SPARK-13997:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11806

> Use Hadoop 2.0 default value for compression in data sources
> 
>
> Key: SPARK-13997
> URL: https://issues.apache.org/jira/browse/SPARK-13997
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, JSON, TEXT and CSV data sources use {{CompressionCodecs}} class to 
> set compression configurations via {{option("compress", "codec")}}.
> I made this uses Hadoop 1.x default value (block level compression). However, 
> the default value in Hadoop 2.x is record level compression as described in 
> [mapred-site.xml|https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml].
> Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default 
> values.
> According to [Hadoop Definitive Guide 3th 
> edition|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html],
>  it looks configurations for the unit of compression (record or block).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12789) Support order by position in SQL

2016-03-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12789:

Description: 
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1, c3
{noformat}

We should make sure this also works with select *.


  was:
This is to support order by position in SQL, e.g.

{noformat}
select c1, c2, c3 from tbl order by 1, 3
{noformat}

should be equivalent to

{noformat}
select c1, c2, c3 from tbl order by c1, c3
{noformat}





> Support order by position in SQL
> 
>
> Key: SPARK-12789
> URL: https://issues.apache.org/jira/browse/SPARK-12789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
> This is to support order by position in SQL, e.g.
> {noformat}
> select c1, c2, c3 from tbl order by 1, 3
> {noformat}
> should be equivalent to
> {noformat}
> select c1, c2, c3 from tbl order by c1, c3
> {noformat}
> We should make sure this also works with select *.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13941) kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker

2016-03-19 Thread Hurshal Patel (JIRA)
Hurshal Patel created SPARK-13941:
-

 Summary: kafka.cluster.BrokerEndPoint cannot be cast to 
kafka.cluster.Broker
 Key: SPARK-13941
 URL: https://issues.apache.org/jira/browse/SPARK-13941
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1
Reporter: Hurshal Patel


I am connecting to a Kafka cluster with the following (anonymized) code:

{code:scala}
  var stream = KafkaUtils.createDirectStreamFromZookeeper[String, Array[Byte], 
StringDecoder, DefaultDecoder](
  ssc, kafkaParams, topics)
  stream.foreachRDD { rdd =>
val df = sqlContext.createDataFrame(rdd.map(bytesToString), stringSchema)
df.foreachPartition { partition => 
  val targetNode = chooseTarget(TaskContext.partitionId)
  loadPartition(targetNode, partition)
}
  }
{code}

I am using Kafka 0.8.2.0-1.kafka1.2.0.p0.2 (Cloudera CDH 5.3.1) and Spark 1.4.1 
and this works fine.

After upgrading to Spark 1.5.1, my tasks are failing (stacktrace is below). Are 
there any notable changes to the KafkaDirectStream or KafkaRDD that would cause 
this or does Cloudera's Kafka distribution have known issues with 1.5.1?

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 12407.0 failed 4 times, most recent failure: Lost task 5.3 in stage 
12407.0 (TID 55638, 172.18.203.25): org.apache.spark.SparkException: Couldn't 
connect to leader for topic XXX: java.lang.ClassCastException: 
kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$connectLeader$1.apply(KafkaRDD.scala:164)
at scala.util.Either.fold(Either.scala:97)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.connectLeader(KafkaRDD.scala:163)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:155)
at org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:135)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 

[jira] [Created] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary

2016-03-19 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-14002:
--

 Summary: SQLBuilder should add subquery to Aggregate child when 
necessary
 Key: SPARK-14002
 URL: https://issues.apache.org/jira/browse/SPARK-14002
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this 
issue:
{code}
  test("bug") {
checkHiveQl(
  """SELECT COUNT(id)
|FROM
|(
|  SELECT id FROM t0
|) subq
  """.stripMargin
)
  }
{code}
Generated wrong SQL is:
{code:sql}
SELECT `gen_attr_46` AS `count(id)`
FROM
(
SELECT count(`gen_attr_45`) AS `gen_attr_46`
FROM
SELECT `gen_attr_45`-- 
FROM--
(   -- A subquery
SELECT `id` AS `gen_attr_45`-- is missing
FROM `default`.`t0` --
) AS gen_subquery_0 --
) AS gen_subquery_1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13996) Add more not null attributes for Filter codegen

2016-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200946#comment-15200946
 ] 

Apache Spark commented on SPARK-13996:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11810

> Add more not null attributes for Filter codegen
> ---
>
> Key: SPARK-13996
> URL: https://issues.apache.org/jira/browse/SPARK-13996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Filter codegen finds the attributes not null by checking IsNotNull(a) 
> expression with a condition {{if child.output.contains(a)}}. However, the 
> current approach to checking it is not comprehensive. We can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13923) Implement SessionCatalog to manage temp functions and tables

2016-03-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198529#comment-15198529
 ] 

Yin Huai commented on SPARK-13923:
--

Please note that the temp function part of the pr is a placeholder. We will 
finish it in the follow-up pr.

> Implement SessionCatalog to manage temp functions and tables
> 
>
> Key: SPARK-13923
> URL: https://issues.apache.org/jira/browse/SPARK-13923
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Today, we have ExternalCatalog, which is dead code. As part of the effort of 
> merging SQLContext/HiveContext we'll parse Hive commands and call the 
> corresponding methods in ExternalCatalog.
> However, this handles only persisted things. We need something in addition to 
> that to handle temporary things. The new catalog is called SessionCatalog and 
> will internally call ExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13934) SqlParser.parseTableIdentifier cannot recognize table name start with scientific notation

2016-03-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197241#comment-15197241
 ] 

Sean Owen commented on SPARK-13934:
---

What does scientific notation have to do with it? it's a table identifier. Is 
it the fact that it starts with a number?

> SqlParser.parseTableIdentifier cannot recognize table name start with 
> scientific notation
> -
>
> Key: SPARK-13934
> URL: https://issues.apache.org/jira/browse/SPARK-13934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yang Wang
>
> SqlParser.parseTableIdentifier cannot recognize table name start with 
> scientific notation like "1e30abcdedfg".
> This bug can be reproduced by code following:
> val conf = new SparkConf().setAppName(s"test").setMaster("local[2]")
> val sc = new SparkContext(conf)
> val hc = new HiveContext(sc)
> val tableName = "1e34abcd"
> hc.sql("select 123").registerTempTable(tableName)
> hc.dropTempTable(tableName)
> The last line will throw a RuntimeException.(java.lang.RuntimeException: 
> [1.1] failure: identifier expected)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13973) `ipython notebook` is going away...

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13973:


Assignee: Apache Spark

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Apache Spark
>Priority: Trivial
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13950) Generate code for sort merge left/right outer join

2016-03-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13950:


Assignee: Apache Spark  (was: Davies Liu)

> Generate code for sort merge left/right outer join
> --
>
> Key: SPARK-13950
> URL: https://issues.apache.org/jira/browse/SPARK-13950
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13963) Add binary toggle Param to ml.HashingTF

2016-03-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200048#comment-15200048
 ] 

Nick Pentreath commented on SPARK-13963:


Sure, assigned to you.

> Add binary toggle Param to ml.HashingTF
> ---
>
> Key: SPARK-13963
> URL: https://issues.apache.org/jira/browse/SPARK-13963
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Bryan Cutler
>Priority: Trivial
>
> It would be handy to add a binary toggle Param to {{HashingTF}}, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >