[jira] [Commented] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message

2015-11-03 Thread Charles Yeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986884#comment-14986884
 ] 

Charles Yeh commented on SPARK-11218:
-

Created a pull request https://github.com/apache/spark/pull/9432

> `./sbin/start-slave.sh --help` should print out the help message
> 
>
> Key: SPARK-11218
> URL: https://issues.apache.org/jira/browse/SPARK-11218
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Reading the sources has showed that the command {{./sbin/start-slave.sh 
> --help}} should print out the help message. It doesn't really.
> {code}
> ➜  spark git:(master) ✗ ./sbin/start-slave.sh --help
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
> --properties-file FILE   Path to a custom Spark properties file.
>  Default is conf/spark-defaults.conf.
> full log in 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986939#comment-14986939
 ] 

Sean Owen commented on SPARK-11472:
---

I don't believe you can do this in the shell. You are not intended to stop the 
context or create a new one.

> SparkContext creation error after sc.stop() when Spark is compiled for Hive
> ---
>
> Key: SPARK-11472
> URL: https://issues.apache.org/jira/browse/SPARK-11472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Red Hat ES 6.7 x86_64
> Spark 1.5.1, Scala 2.10.4, Java 1.7.0_85, Hive 1.2.1
> Authentication done through Kerberos
>Reporter: Pierre Beauvois
>
> Spark 1.5.1 has been compiled with the following command :
> {noformat}
> mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package
> {noformat}
> After its installation, the file "hive-site.xml" has been added in the conf 
> directory (this is not an hard copy, it's a symbolic link). 
> When the spark-shell is started, the SparkContext and the sqlContext are 
> properly created. Nevertheless, when I stop the SparkContext and then try to 
> create a new one, an error appears. The output of this error is the following:
> {code:title=SparkContextCreationError.scala|borderStyle=solid}
> // imports
> scala> import org.apache.spark.SparkConf
> import org.apache.spark.SparkConf
> scala> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext
> scala> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkContext._
> // simple SparkContext creation
> scala> val sc = new SparkContext(new SparkConf())
> // output error stack
> 15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
> issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy19.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getDelegationToken(Hive.java:3150)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1260)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at 

[jira] [Updated] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Pierre Beauvois (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Beauvois updated SPARK-11472:

Priority: Major  (was: Critical)

> SparkContext creation error after sc.stop() when Spark is compiled for Hive
> ---
>
> Key: SPARK-11472
> URL: https://issues.apache.org/jira/browse/SPARK-11472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Red Hat ES 6.7 x86_64
> Spark 1.5.1, Scala 2.10.4, Java 1.7.0_85, Hive 1.2.1
> Authentication done through Kerberos
>Reporter: Pierre Beauvois
>
> Spark 1.5.1 has been compiled with the following command :
> {noformat}
> mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package
> {noformat}
> After its installation, the file "hive-site.xml" has been added in the conf 
> directory (this is not an hard copy, it's a symbolic link). 
> When the spark-shell is started, the SparkContext and the sqlContext are 
> properly created. Nevertheless, when I stop the SparkContext and then try to 
> create a new one, an error appears. The output of this error is the following:
> {code:title=SparkContextCreationError.scala|borderStyle=solid}
> // imports
> scala> import org.apache.spark.SparkConf
> import org.apache.spark.SparkConf
> scala> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext
> scala> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkContext._
> // simple SparkContext creation
> scala> val sc = new SparkContext(new SparkConf())
> // output error stack
> 15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
> issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy19.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getDelegationToken(Hive.java:3150)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1260)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
> at org.apache.spark.SparkContext.(SparkContext.scala:523)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
> at 
> $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at 

[jira] [Created] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver

2015-11-03 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11473:
---

 Summary: R-like summary statistics with intercept for OLS via 
normal equation solver
 Key: SPARK-11473
 URL: https://issues.apache.org/jira/browse/SPARK-11473
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Yanbo Liang


SPARK-9836 has provided R-like summary statistics for coefficients, we should 
also add this statistics for intercept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11471) Improve the way that we plan shuffled join

2015-11-03 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11471:


 Summary: Improve the way that we plan shuffled join
 Key: SPARK-11471
 URL: https://issues.apache.org/jira/browse/SPARK-11471
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


Right now, when adaptive query execution is enabled, in most of cases, we will 
shuffle input tables for every join. However, once we finish our work of 
https://issues.apache.org/jira/browse/SPARK-10665, we will be able to have a 
global on the input datasets of a stage. Then, we should be able to add 
exchange coordinators after we get the entire physical plan (after the phase 
that we add Exchanges).

I will try to fill in more information later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11469) Initial implementation

2015-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986938#comment-14986938
 ] 

Sean Owen commented on SPARK-11469:
---

Retroactively, initial implementation of what? I get that this is a sub-task 
but a minimally descriptive title is helpful when browsing these things.

> Initial implementation
> --
>
> Key: SPARK-11469
> URL: https://issues.apache.org/jira/browse/SPARK-11469
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11474) Options to jdbc load are lower cased

2015-11-03 Thread Stephen Samuel (JIRA)
Stephen Samuel created SPARK-11474:
--

 Summary: Options to jdbc load are lower cased
 Key: SPARK-11474
 URL: https://issues.apache.org/jira/browse/SPARK-11474
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.5.1
 Environment: Linux & Mac
Reporter: Stephen Samuel
Priority: Minor


We recently upgraded from spark 1.3.0 to 1.5.1 and one of the features we 
wanted to take advantage of was the fetchSize added to the jdbc data frame 
reader.

In 1.5.1 there appears to be a bug or regression, whereby an options map has 
its keys lowercased. This means the existing properties from prior to 1.4 are 
ok, such as dbtable, url and driver, but the newer fetchSize gets converted to 
fetchsize.

To re-produce:

val conf = new SparkConf(true).setMaster("local").setAppName("fetchtest")
val sc = new SparkContext(conf)
val sql = new SQLContext(sc)

val options = Map("url" -> , "driver" -> , "fetchSize" -> )
val df = sql.load("jdbc", options)

Breakpoint at line 371 in JDBCRDD and you'll see the options are all 
lowercased, so:
val fetchSize = properties.getProperty("fetchSize", "0").toInt
results in 0

Now I know sql.load is deprecated, but this might be occuring on other methods 
too. The workaround is to use the java.util.Properties overload, which keeps 
the case sensitive keys.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Pierre Beauvois (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Beauvois updated SPARK-11472:

Description: 
Spark 1.5.1 has been compiled with the following command :

{noformat}
mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package
{noformat}

After its installation, the file "hive-site.xml" has been added in the conf 
directory (this is not an hard copy, it's a symbolic link). 

When the spark-shell is started, the SparkContext and the sqlContext are 
properly created. Nevertheless, when I stop the SparkContext and then try to 
create a new one, an error appears. The output of this error is the following:
{code:title=SparkContextCreationError.scala|borderStyle=solid}
// imports
scala> import org.apache.spark.SparkConf
import org.apache.spark.SparkConf

scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala> import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext._

// simple SparkContext creation
scala> val sc = new SparkContext(new SparkConf())
// output error stack
15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy19.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getDelegationToken(Hive.java:3150)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1260)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at 
$line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at 
$line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
at 
$line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
at $line23.$read$$iwC$$iwC$$iwC$$iwC.(:39)
at $line23.$read$$iwC$$iwC$$iwC.(:41)
at $line23.$read$$iwC$$iwC.(:43)
at $line23.$read$$iwC.(:45)
at $line23.$read.(:47)
at $line23.$read$.(:51)
at $line23.$read$.()
at $line23.$eval$.(:7)
at $line23.$eval$.()
at $line23.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

[jira] [Created] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Pierre Beauvois (JIRA)
Pierre Beauvois created SPARK-11472:
---

 Summary: SparkContext creation error after sc.stop() when Spark is 
compiled for Hive
 Key: SPARK-11472
 URL: https://issues.apache.org/jira/browse/SPARK-11472
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.5.1
 Environment: Red Hat ES 6.7 x86_64
Spark 1.5.1, Scala 2.10.4, Java 1.7.0_85, Hive 1.2.1
Authentication done through Kerberos
Reporter: Pierre Beauvois
Priority: Critical


Spark 1.5.1 has been compiled with the following command :

mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package

After its installation, the file "hive-site.xml" has been added in the conf 
directory (this is not an hard copy, it's a symbolic link). 

When the spark-shell is started, the SparkContext and the sqlContext are 
properly created. Nevertheless, when I stop the SparkContext and then try to 
create a new one, an error appears. The output of this error is the following:

scala> import org.apache.spark.SparkConf
import org.apache.spark.SparkConf

scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala> import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext._

scala> val sc = new SparkContext(new SparkConf())
15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy19.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getDelegationToken(Hive.java:3150)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1260)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at 
$line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at 
$line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
at 
$line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line23.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
at $line23.$read$$iwC$$iwC$$iwC$$iwC.(:39)
at $line23.$read$$iwC$$iwC$$iwC.(:41)
at $line23.$read$$iwC$$iwC.(:43)
at $line23.$read$$iwC.(:45)
at $line23.$read.(:47)
at $line23.$read$.(:51)
at $line23.$read$.()
at $line23.$eval$.(:7)
at $line23.$eval$.()
at $line23.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[jira] [Resolved] (SPARK-9859) Aggregation: Determine the number of reducers at runtime

2015-11-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9859.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9276
[https://github.com/apache/spark/pull/9276]

> Aggregation: Determine the number of reducers at runtime
> 
>
> Key: SPARK-9859
> URL: https://issues.apache.org/jira/browse/SPARK-9859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-11-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9858.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9276
[https://github.com/apache/spark/pull/9276]

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9861) Join: Determine the number of reducers used by a shuffle join operator at runtime

2015-11-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9861.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9276
[https://github.com/apache/spark/pull/9276]

> Join: Determine the number of reducers used by a shuffle join operator at 
> runtime
> -
>
> Key: SPARK-9861
> URL: https://issues.apache.org/jira/browse/SPARK-9861
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11346) Spark EventLog for completed applications

2015-11-03 Thread Milan Brna (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Brna closed SPARK-11346.
--
Resolution: Fixed

> Spark EventLog for completed applications
> -
>
> Key: SPARK-11346
> URL: https://issues.apache.org/jira/browse/SPARK-11346
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Centos 6.7
>Reporter: Milan Brna
> Attachments: eventLogTest.scala
>
>
> Environment description: Spark 1.5.1 build following way:
> ./dev/change-scala-version.sh 2.11
> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
> ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn 
> -Dscala-2.11 -Phive -Phive-thriftserver
> 4 node standalone cluster (node1-node4)
> Master configuration in spark-defaults.conf:
> spark.eventLog.enabledtrue
> spark.eventLog.dirhdfs://node1:38200/user/spark-events
> The same configuration was created during tests of event logging on all 4 
> nodes.
> Cluster is started from node1 (master) by ./start-all.sh, thrift server and 
> history server are additionally started
> Simple application (see attached scala file eventLogTest.scala) is executed 
> from remote laptop, using intellij GUI.
> When conf.set("spark.eventLog.enabled","true") and 
> conf.set("spark.eventLog.dir","hdfs://node1:38200/user/spark-events")
> are un-commented, application eventlog directory in 
> hdfs://node1:38200/user/spark-events is created and contains data.
> History server properly sees and presents content. Everything allright as far.
> If both parameters in application are turned off (commented in source) 
> however, no eventlog directory is ever created for the application.
> I'd expect that parameters spark.eventLog.enabled and spark.eventLog.dir from 
> spark-defaults.conf which is present on all four nodes will be sufficient for 
> the application (even remote) to create eventlog.
> Additionally, I have experimented with following options on all four nodes in 
> spark-env.sh:
> SPARK_MASTER_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_WORKER_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_JAVA_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> JAVA_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_CONF_DIR="/u01/com/app/spark-1.5.1-bin-cdma-spark/conf"
> SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_SHUFFLE_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> SPARK_DAEMON_JAVA_OPTS="-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs://node1:38200/user/spark-events"
> and I have even tried to set following configuration option in application 
> spark context configuration:
> conf.set("spark.submit.deployMode","cluster")
> but none of these settings caused eventlog to appear for completed 
> application.
> EventLog is present for application started from the cluster servers i.e. 
> pyspark, thrift server
> Question: Is this correct behaviour that executing application from remote 
> intellij produces no eventlog unless these options are explicitely specified 
> by configuration inside scala code of the application, hence ignoring 
> settings in spark-defaults.conf file?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Pierre Beauvois (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986984#comment-14986984
 ] 

Pierre Beauvois commented on SPARK-11472:
-

Hi Sean, thanks for you quick reply.

Sorry to say that but your feedback is losing me. 

I thought there was several ways to initialize a SparkContext: 

* during the shell startup (example below)

{noformat}
spark-shell -v --jars 
/opt/application/Spark/current/elastic/jar/elasticsearch-hadoop-2.1.1.jar 
--name elasticsearch-hadoop --master yarn-client --conf spark.es.net.ssl=true 
--conf spark.es.net.http.auth.user=asterix --conf 
spark.es.net.http.auth.pass=obelix --conf spark.es.nodes=potion.magique --conf 
spark.es.port=9200 --conf spark.es.field.read.empty.as.null=true
{noformat}

You can do something similar with spark-submit. 

==> working with Spark 1.5.1 and with or without the hive-site.xml

* from the command-line (example below)

{noformat}
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> sc.stop()
scala> val conf = new 
SparkConf().setAppName("elasticsearch-hadoop").setMaster("yarn-client")
conf.set("es.net.ssl", "true")
conf.set("es.net.http.auth.user","asterix")
conf.set("es.net.http.auth.pass","obelix")
conf.set("es.nodes", "potion.magique")
conf.set("es.port", "9200")
scala> val sc = new SparkContext(conf)
{noformat}

This is process is described here in the Spark documentation: 
[https://spark.apache.org/docs/latest/programming-guide.html#initializing-spark]

This is also explained here: 
[https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html#spark-native-cfg]
{color:red}
==> working with Spark 1.5.1 only without the hive-site.xml
{color}

* from an external file

{noformat}
spark-shell -v --jars 
/opt/application/Spark/current/elastic/jar/elasticsearch-hadoop-2.1.1.jar -i 
elastic-hadoop.scala
{noformat}

And the .scala file contains the commands used in the second point.

==> working with Spark 1.5.1 and with or without the hive-site.xml

If I'm not intended to stop the context or create a new one, why the option is 
still available? Moreover, why the Spark document explains how to stop/create a 
SparkContext? I'm lost...


> SparkContext creation error after sc.stop() when Spark is compiled for Hive
> ---
>
> Key: SPARK-11472
> URL: https://issues.apache.org/jira/browse/SPARK-11472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Red Hat ES 6.7 x86_64
> Spark 1.5.1, Scala 2.10.4, Java 1.7.0_85, Hive 1.2.1
> Authentication done through Kerberos
>Reporter: Pierre Beauvois
>
> Spark 1.5.1 has been compiled with the following command :
> {noformat}
> mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package
> {noformat}
> After its installation, the file "hive-site.xml" has been added in the conf 
> directory (this is not an hard copy, it's a symbolic link). 
> When the spark-shell is started, the SparkContext and the sqlContext are 
> properly created. Nevertheless, when I stop the SparkContext and then try to 
> create a new one, an error appears. The output of this error is the following:
> {code:title=SparkContextCreationError.scala|borderStyle=solid}
> // imports
> scala> import org.apache.spark.SparkConf
> import org.apache.spark.SparkConf
> scala> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext
> scala> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkContext._
> // simple SparkContext creation
> scala> val sc = new SparkContext(new SparkConf())
> // output error stack
> 15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
> issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
>   

[jira] [Commented] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Pierre Beauvois (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986985#comment-14986985
 ] 

Pierre Beauvois commented on SPARK-11472:
-

Hi Sean, thanks for you quick reply.

Sorry to say that but your feedback is losing me. 

I thought there was several ways to initialize a SparkContext: 

* during the shell startup (example below)

{noformat}
spark-shell -v --jars 
/opt/application/Spark/current/elastic/jar/elasticsearch-hadoop-2.1.1.jar 
--name elasticsearch-hadoop --master yarn-client --conf spark.es.net.ssl=true 
--conf spark.es.net.http.auth.user=asterix --conf 
spark.es.net.http.auth.pass=obelix --conf spark.es.nodes=potion.magique --conf 
spark.es.port=9200 --conf spark.es.field.read.empty.as.null=true
{noformat}

You can do something similar with spark-submit. 

==> working with Spark 1.5.1 and with or without the hive-site.xml

* from the command-line (example below)

{noformat}
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> sc.stop()
scala> val conf = new 
SparkConf().setAppName("elasticsearch-hadoop").setMaster("yarn-client")
conf.set("es.net.ssl", "true")
conf.set("es.net.http.auth.user","asterix")
conf.set("es.net.http.auth.pass","obelix")
conf.set("es.nodes", "potion.magique")
conf.set("es.port", "9200")
scala> val sc = new SparkContext(conf)
{noformat}

This is process is described here in the Spark documentation: 
[https://spark.apache.org/docs/latest/programming-guide.html#initializing-spark]

This is also explained here: 
[https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html#spark-native-cfg]
{color:red}
==> working with Spark 1.5.1 only without the hive-site.xml
{color}

* from an external file

{noformat}
spark-shell -v --jars 
/opt/application/Spark/current/elastic/jar/elasticsearch-hadoop-2.1.1.jar -i 
elastic-hadoop.scala
{noformat}

And the .scala file contains the commands used in the second point.

==> working with Spark 1.5.1 and with or without the hive-site.xml

If I'm not intended to stop the context or create a new one, why the option is 
still available? Moreover, why the Spark document explains how to stop/create a 
SparkContext? I'm lost...


> SparkContext creation error after sc.stop() when Spark is compiled for Hive
> ---
>
> Key: SPARK-11472
> URL: https://issues.apache.org/jira/browse/SPARK-11472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Red Hat ES 6.7 x86_64
> Spark 1.5.1, Scala 2.10.4, Java 1.7.0_85, Hive 1.2.1
> Authentication done through Kerberos
>Reporter: Pierre Beauvois
>
> Spark 1.5.1 has been compiled with the following command :
> {noformat}
> mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package
> {noformat}
> After its installation, the file "hive-site.xml" has been added in the conf 
> directory (this is not an hard copy, it's a symbolic link). 
> When the spark-shell is started, the SparkContext and the sqlContext are 
> properly created. Nevertheless, when I stop the SparkContext and then try to 
> create a new one, an error appears. The output of this error is the following:
> {code:title=SparkContextCreationError.scala|borderStyle=solid}
> // imports
> scala> import org.apache.spark.SparkConf
> import org.apache.spark.SparkConf
> scala> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext
> scala> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkContext._
> // simple SparkContext creation
> scala> val sc = new SparkContext(new SparkConf())
> // output error stack
> 15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
> issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
>   

[jira] [Issue Comment Deleted] (SPARK-11472) SparkContext creation error after sc.stop() when Spark is compiled for Hive

2015-11-03 Thread Pierre Beauvois (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Beauvois updated SPARK-11472:

Comment: was deleted

(was: Hi Sean, thanks for you quick reply.

Sorry to say that but your feedback is losing me. 

I thought there was several ways to initialize a SparkContext: 

* during the shell startup (example below)

{noformat}
spark-shell -v --jars 
/opt/application/Spark/current/elastic/jar/elasticsearch-hadoop-2.1.1.jar 
--name elasticsearch-hadoop --master yarn-client --conf spark.es.net.ssl=true 
--conf spark.es.net.http.auth.user=asterix --conf 
spark.es.net.http.auth.pass=obelix --conf spark.es.nodes=potion.magique --conf 
spark.es.port=9200 --conf spark.es.field.read.empty.as.null=true
{noformat}

You can do something similar with spark-submit. 

==> working with Spark 1.5.1 and with or without the hive-site.xml

* from the command-line (example below)

{noformat}
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkContext._
scala> sc.stop()
scala> val conf = new 
SparkConf().setAppName("elasticsearch-hadoop").setMaster("yarn-client")
conf.set("es.net.ssl", "true")
conf.set("es.net.http.auth.user","asterix")
conf.set("es.net.http.auth.pass","obelix")
conf.set("es.nodes", "potion.magique")
conf.set("es.port", "9200")
scala> val sc = new SparkContext(conf)
{noformat}

This is process is described here in the Spark documentation: 
[https://spark.apache.org/docs/latest/programming-guide.html#initializing-spark]

This is also explained here: 
[https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html#spark-native-cfg]
{color:red}
==> working with Spark 1.5.1 only without the hive-site.xml
{color}

* from an external file

{noformat}
spark-shell -v --jars 
/opt/application/Spark/current/elastic/jar/elasticsearch-hadoop-2.1.1.jar -i 
elastic-hadoop.scala
{noformat}

And the .scala file contains the commands used in the second point.

==> working with Spark 1.5.1 and with or without the hive-site.xml

If I'm not intended to stop the context or create a new one, why the option is 
still available? Moreover, why the Spark document explains how to stop/create a 
SparkContext? I'm lost...
)

> SparkContext creation error after sc.stop() when Spark is compiled for Hive
> ---
>
> Key: SPARK-11472
> URL: https://issues.apache.org/jira/browse/SPARK-11472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Red Hat ES 6.7 x86_64
> Spark 1.5.1, Scala 2.10.4, Java 1.7.0_85, Hive 1.2.1
> Authentication done through Kerberos
>Reporter: Pierre Beauvois
>
> Spark 1.5.1 has been compiled with the following command :
> {noformat}
> mvn -Pyarn -Phive -Phive-thriftserver -PsparkR -DskipTests -X clean package
> {noformat}
> After its installation, the file "hive-site.xml" has been added in the conf 
> directory (this is not an hard copy, it's a symbolic link). 
> When the spark-shell is started, the SparkContext and the sqlContext are 
> properly created. Nevertheless, when I stop the SparkContext and then try to 
> create a new one, an error appears. The output of this error is the following:
> {code:title=SparkContextCreationError.scala|borderStyle=solid}
> // imports
> scala> import org.apache.spark.SparkConf
> import org.apache.spark.SparkConf
> scala> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext
> scala> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkContext._
> // simple SparkContext creation
> scala> val sc = new SparkContext(new SparkConf())
> // output error stack
> 15/11/03 09:10:05 ERROR Hive: MetaException(message:Delegation Token can be 
> issued only with kerberos authentication. Current AuthenticationMethod: TOKEN)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result$get_delegation_token_resultStandardScheme.read(ThriftHiveMetastore.java)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_delegation_token_result.read(ThriftHiveMetastore.java)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_delegation_token(ThriftHiveMetastore.java:3715)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_delegation_token(ThriftHiveMetastore.java:3701)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDelegationToken(HiveMetaStoreClient.java:1796)
> 

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-03 Thread Disha Shrivastava (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987412#comment-14987412
 ] 

Disha Shrivastava commented on SPARK-5575:
--

I wanted to know if anyone is working on distributed implementation of RNN / 
LSTM on Spark. I would love to work on this and ask for ideas on how it can be 
done or can suggest some papers as starting point. Also, I wanted to know if 
Spark would be an ideal platform to have a distributive implementation for 
RNN/LSTM

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11256) Mark all Stage/ResultStage/ShuffleMapStage internal state as private.

2015-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11256.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Mark all Stage/ResultStage/ShuffleMapStage internal state as private.
> -
>
> Key: SPARK-11256
> URL: https://issues.apache.org/jira/browse/SPARK-11256
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:

Target Version/s:   (was: 1.6.0)

> Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Critical
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4131:

Target Version/s:   (was: 1.6.0)

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9988) Create local (external) sort operator

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9988:

Target Version/s:   (was: 1.6.0)

> Create local (external) sort operator
> -
>
> Key: SPARK-9988
> URL: https://issues.apache.org/jira/browse/SPARK-9988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Similar to the TungstenSort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9989) Create local sort-merge join operator

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9989:

Target Version/s:   (was: 1.6.0)

> Create local sort-merge join operator
> -
>
> Key: SPARK-9989
> URL: https://issues.apache.org/jira/browse/SPARK-9989
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The local SortMergeJoinNode can assume both side of the inputs are already 
> sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9987) Create local aggregate operator

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9987:

Target Version/s:   (was: 1.6.0)

> Create local aggregate operator
> ---
>
> Key: SPARK-9987
> URL: https://issues.apache.org/jira/browse/SPARK-9987
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9689:

Target Version/s:   (was: 1.6.0)

> Cache doesn't refresh for HadoopFsRelation based table
> --
>
> Key: SPARK-9689
> URL: https://issues.apache.org/jira/browse/SPARK-9689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>
> {code:title=example|borderStyle=solid}
> // create a HadoopFsRelation based table
> sql(s"""
> |CREATE TEMPORARY TABLE jsonTable (a int, b string)
> |USING org.apache.spark.sql.json.DefaultSource
> |OPTIONS (
> |  path '${path.toString}'
> |)""".stripMargin)
>   
> // give the value from table jt
> sql(
>   s"""
>   |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt
> """.stripMargin)
> // cache the HadoopFsRelation Table
> sqlContext.cacheTable("jsonTable")
>
> // update the HadoopFsRelation Table
> sql(
>   s"""
> |INSERT OVERWRITE TABLE jsonTable SELECT a * 2, b FROM jt
>   """.stripMargin)
> // Even this will fail
>  sql("SELECT a, b FROM jsonTable").collect()
> // This will fail, as the cache doesn't refresh
> checkAnswer(
>   sql("SELECT a, b FROM jsonTable"),
>   sql("SELECT a * 2, b FROM jt").collect())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-03 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987096#comment-14987096
 ] 

Kapil Singh commented on SPARK-11447:
-

On second look, I seem to have identified the issue. Take a look at lines 
283-286 here:
https://github.com/apache/spark/blob/a01cbf5daac148f39cd97299780f542abc41d1e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

If one of the types in BinaryComparison is StringType and other is NullType, 
during analyzed plan computation, this forces DoubleType on the StringType. 
Later while enforcing this cast (lines 340-343 in 
https://github.com/apache/spark/blob/a01cbf5daac148f39cd97299780f542abc41d1e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala),
 conversion to double fails if string is not actually a number and a default 
null value is assigned to result. This manifests as null comparison resulting 
true for all string values. 

> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Updated] (SPARK-9879) OOM in LIMIT clause with large number

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9879:

Target Version/s:   (was: 1.6.0)

> OOM in LIMIT clause with large number
> -
>
> Key: SPARK-9879
> URL: https://issues.apache.org/jira/browse/SPARK-9879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> {code}
> create table spark.tablsetest as select * from dpa_ord_bill_tf order by 
> member_id limit 2000;
> {code}
>  
> {code}
> spark-sql --driver-memory 48g --executor-memory 24g --driver-java-options 
> -XX:PermSize=1024M -XX:MaxPermSize=2048M
> Error logs
> 15/07/27 10:22:43 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-20]shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
> at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
> at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.close(Output.java:165)
> at 
> org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
> at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:139)
> at 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$writeObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:65)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
> at 
> org.apache.spark.rdd.ParallelCollectionPartition.writeObject(ParallelCollectionRDD.scala:51)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
> at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:168)
> at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:231)
> 15/07/27 10:22:43 ERROR ErrorMonitor: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-20]shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
> at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
> at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.close(Output.java:165)
> at 
> org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
> at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:139)
> at 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$writeObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:65)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
> at 
> 

[jira] [Updated] (SPARK-7549) Support aggregating over nested fields

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7549:

Target Version/s:   (was: 1.6.0)

> Support aggregating over nested fields
> --
>
> Key: SPARK-7549
> URL: https://issues.apache.org/jira/browse/SPARK-7549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Would be nice to be able to run sum, avg, min, max (and other numeric 
> aggregate expressions) on arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9697) Project Tungsten (Spark 1.6)

2015-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987094#comment-14987094
 ] 

Michael Armbrust commented on SPARK-9697:
-

[~rxin] can you update this now that we are past code freeze for 1.6?

> Project Tungsten (Spark 1.6)
> 
>
> Key: SPARK-9697
> URL: https://issues.apache.org/jira/browse/SPARK-9697
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Shuffle, Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This epic tracks the 2nd phase of Project Tungsten, slotted for Spark 1.6 
> release.
> This epic tracks work items for Spark 1.6. More tickets can be found in:
> SPARK-7075: Tungsten-related work in Spark 1.5
> SPARK-9697: Tungsten-related work in Spark 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9860) Join: Determine the join strategy (broadcast join or shuffle join) at runtime

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9860:

Target Version/s:   (was: 1.6.0)

> Join: Determine the join strategy (broadcast join or shuffle join) at runtime
> -
>
> Key: SPARK-9860
> URL: https://issues.apache.org/jira/browse/SPARK-9860
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time

2015-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11460:


Assignee: Apache Spark

> Locality waits should be based on task set creation time, not last launch time
> --
>
> Key: SPARK-11460
> URL: https://issues.apache.org/jira/browse/SPARK-11460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: YARN
>Reporter: Shengyue Ji
>Assignee: Apache Spark
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Spark waits for spark.locality.waits period before going from RACK_LOCAL to 
> ANY when selecting an executor for assignment. The timeout was essentially 
> reset each time a new assignment is made.
> We were running Spark streaming on Kafka with a 10 second batch window on 32 
> Kafka partitions with 16 executors. All executors were in the ANY group. At 
> one point one RACK_LOCAL executor was added and all tasks were assigned to 
> it. Each task took about 0.6 second to process, resetting the 
> spark.locality.wait timeout (3000ms) repeatedly. This caused the whole 
> process to under utilize resources and created an increasing backlog.
> spark.locality.wait should be based on the task set creation time, not last 
> launch time so that after 3000ms of initial creation, all executors can get 
> tasks assigned to them.
> We are specifying a zero timeout for now as a workaround to disable locality 
> optimization. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11344) ApplicationDescription should be immutable case class

2015-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11344:
--
Assignee: Jacek Lewandowski

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Assignee: Jacek Lewandowski
>Priority: Minor
> Fix For: 1.6.0
>
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11344) ApplicationDescription should be immutable case class

2015-11-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11344.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9299
[https://github.com/apache/spark/pull/9299]

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Priority: Minor
> Fix For: 1.6.0
>
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8122:

Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-5463)

> ParquetRelation.enableLogForwarding() may fail to configure loggers
> ---
>
> Key: SPARK-8122
> URL: https://issues.apache.org/jira/browse/SPARK-8122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> _enableLogForwarding()_ doesn't hold to the created loggers that can be 
> garbage collected and all configuration changes will be gone. From 
> https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html 
> javadocs:  _It is important to note that the Logger returned by one of the 
> getLogger factory methods may be garbage collected at any time if a strong 
> reference to the Logger is not kept._
> All created logger references need to be kept, e.g. in static variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5463) Improve Parquet support (reliability, performance, and error messages)

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5463.
-
Resolution: Fixed

> Improve Parquet support (reliability, performance, and error messages)
> --
>
> Key: SPARK-5463
> URL: https://issues.apache.org/jira/browse/SPARK-5463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5463) Improve Parquet support (reliability, performance, and error messages)

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5463:

Assignee: Cheng Lian

> Improve Parquet support (reliability, performance, and error messages)
> --
>
> Key: SPARK-5463
> URL: https://issues.apache.org/jira/browse/SPARK-5463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9995) Create local Python evaluation operator

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9995:

Target Version/s:   (was: 1.6.0)

> Create local Python evaluation operator
> ---
>
> Key: SPARK-9995
> URL: https://issues.apache.org/jira/browse/SPARK-9995
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Similar to BatchPythonEvaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8964) Use Exchange in limit operations (per partition limit -> exchange to one partition -> per partition limit)

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8964:

Target Version/s:   (was: 1.6.0)

> Use Exchange in limit operations (per partition limit -> exchange to one 
> partition -> per partition limit)
> --
>
> Key: SPARK-8964
> URL: https://issues.apache.org/jira/browse/SPARK-8964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> Spark SQL's physical Limit operator currently performs its own shuffle rather 
> than using Exchange to perform the shuffling.  This is less efficient since 
> this non-exchange shuffle path won't be able to benefit from SQL-specific 
> shuffling optimizations, such as SQLSerializer2.  It also involves additional 
> unnecessary row copying.
> Instead, I think that we should rewrite Limit to expand into three physical 
> operators:
> PerParititonLimit -> Exchange to one partition -> PerPartitionLimit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8745) Remove GenerateProjection

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8745:

Target Version/s:   (was: 1.6.0)

> Remove GenerateProjection
> -
>
> Key: SPARK-8745
> URL: https://issues.apache.org/jira/browse/SPARK-8745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Based on discussion offline with [~marmbrus], we should remove 
> GenerateProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11196) Support for equality and pushdown of filters on some UDTs

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11196:
-
Target Version/s:   (was: 1.6.0)

> Support for equality and pushdown of filters on some UDTs
> -
>
> Key: SPARK-11196
> URL: https://issues.apache.org/jira/browse/SPARK-11196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>
> Today if you try and do any comparisons with UDTs it fails due to bad 
> casting.  However, in some cases the UDT is just a thing wrapper around a SQL 
> type (StringType for example).  In these cases we could just convert the UDT 
> to its SQL type.
> Rough prototype: 
> https://github.com/apache/spark/compare/apache:master...marmbrus:uuid-udt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10818) Query optimization: investigate whether we need a separate optimizer from Spark SQL's

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10818:
-
Target Version/s:   (was: 1.6.0)

> Query optimization: investigate whether we need a separate optimizer from 
> Spark SQL's
> -
>
> Key: SPARK-10818
> URL: https://issues.apache.org/jira/browse/SPARK-10818
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> It would be great if we can just reuse Spark SQL's query optimizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7970:

Target Version/s:   (was: 1.6.0)

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
>Assignee: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7903) PythonUDT shouldn't get serialized on the Scala side

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7903:

Target Version/s:   (was: 1.6.0)

> PythonUDT shouldn't get serialized on the Scala side
> 
>
> Key: SPARK-7903
> URL: https://issues.apache.org/jira/browse/SPARK-7903
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> A round trip for a pure Python UDT should be: Python UDT -> Python SQL 
> internal types -> Scala/Java SQL internal types -> transformation -> 
> Scala/Java SQL internal types -> Python SQL internal types -> Python UDT. So 
> the serialization shouldn't be invoked on the Scala side if no Scala code is 
> applied to the UDT.
> Code (from [~rams]) to reproduce this bug:
> {code}
> from pyspark.mllib.linalg import SparseVector
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
> sz = udf(lambda s: s.size, IntegerType())
> df.select(sz(df.features).alias("sz")).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7245) Spearman correlation for DataFrames

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7245:

Target Version/s:   (was: 1.6.0)

> Spearman correlation for DataFrames
> ---
>
> Key: SPARK-7245
> URL: https://issues.apache.org/jira/browse/SPARK-7245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>
> Spearman correlation is harder than Pearson to compute.
> ~~~
> df.stat.corr(col1, col2, method="spearman"): Double
> ~~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10815) API design: data sources and sinks

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10815:
-
Target Version/s:   (was: 1.6.0)

> API design: data sources and sinks
> --
>
> Key: SPARK-10815
> URL: https://issues.apache.org/jira/browse/SPARK-10815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8144) For PySpark SQL, automatically convert values provided in readwriter options to string

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8144:

Target Version/s:   (was: 1.6.0)

> For PySpark SQL, automatically convert values provided in readwriter options 
> to string
> --
>
> Key: SPARK-8144
> URL: https://issues.apache.org/jira/browse/SPARK-8144
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Because of typos in lines 81 and 240 of:
> [https://github.com/apache/spark/blob/16fc49617e1dfcbe9122b224f7f63b7bfddb36ce/python/pyspark/sql/readwriter.py]
> (Search for "option(")
> CC: [~yhuai] [~davies]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8108) Build Hive module by default (i.e. remove -Phive profile)

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8108:

Target Version/s: 2+  (was: 1.6.0)

> Build Hive module by default (i.e. remove -Phive profile)
> -
>
> Key: SPARK-8108
> URL: https://issues.apache.org/jira/browse/SPARK-8108
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Reporter: Reynold Xin
>
> I think this is blocked by a jline conflict between Scala 2.11 and Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9372) For a join operator, rows with null equal join key expression can be filtered out early

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9372:

Target Version/s:   (was: 1.6.0)

> For a join operator, rows with null equal join key expression can be filtered 
> out early
> ---
>
> Key: SPARK-9372
> URL: https://issues.apache.org/jira/browse/SPARK-9372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Taking {{select ... from A join B on (A.key = B.key)}} as an example, we can 
> filter out rows that have null values for column A.key/B.key because those 
> rows do not contribute to the result of the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-11-03 Thread melvin mendoza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987190#comment-14987190
 ] 

melvin mendoza edited comment on SPARK-1867 at 11/3/15 12:33 PM:
-

[~srowen] having problem with spark

java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Code snippet: 
  def main(args: Array[String]) {
val LOG = Logger.getLogger(this.getClass().getName() + "Testing")
LOG.info("SAMPLE START")
LOG.info("Testing")

try {
  val conf = new SparkConf()
  val sc = new SparkContext(conf)
  val phoenixSpark = sc.phoenixTableAsRDD(
"SAMPLE_TABLE",
Seq("ID", "NAME"),
zkUrl = Some("r3r31gateway.clustered.com:2181:/hbase-unsecure"))

  val name = phoenixSpark.map(f => f.toString())
  val sample = phoenixSpark.map(f => (f.get("ID") + "," + f.get("NAME")))

  sample.foreach(println)
  LOG.info("SAMPLE TABLE: " + name.toString())
  sc.stop()

} catch {
  case e: Exception => {
e.printStackTrace()
val msg = e.getMessage
LOG.error("Phoenix Testing failure: errorMsg: " + msg)
  }
}
  }

I'm using HDP 2.2


was (Author: mamendoza):
[~srowen] having problem with spark

java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Code snippet: 
  def main(args: Array[String]) {
val LOG = Logger.getLogger(this.getClass().getName() + "Testing")
LOG.info("SAMPLE START")
LOG.info("Testing")

try {
  val conf = new SparkConf()
  val sc = new SparkContext(conf)
  val phoenixSpark = sc.phoenixTableAsRDD(
"SAMPLE_TABLE",
Seq("ID", "NAME"),
zkUrl = Some("r3r31gateway.clustered.com:2181:/hbase-unsecure"))

  val name = phoenixSpark.map(f => f.toString())
  val sample = phoenixSpark.map(f => (f.get("ID") + "," + f.get("NAME")))

  sample.foreach(println)
  LOG.info("SAMPLE TABLE: " + name.toString())
  sc.stop()

} catch {
  case e: Exception => {
e.printStackTrace()
val msg = e.getMessage
LOG.error("Phoenix Testing failure: errorMsg: " + msg)
  }
}
  }


> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. 

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-11-03 Thread melvin mendoza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987190#comment-14987190
 ] 

melvin mendoza commented on SPARK-1867:
---

[~srowen] having problem with spark

java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Code snippet: 
  def main(args: Array[String]) {
val LOG = Logger.getLogger(this.getClass().getName() + "Testing")
LOG.info("SAMPLE START")
LOG.info("Testing")

try {
  val conf = new SparkConf()
  val sc = new SparkContext(conf)
  val phoenixSpark = sc.phoenixTableAsRDD(
"SAMPLE_TABLE",
Seq("ID", "NAME"),
zkUrl = Some("r3r31gateway.clustered.com:2181:/hbase-unsecure"))

  val name = phoenixSpark.map(f => f.toString())
  val sample = phoenixSpark.map(f => (f.get("ID") + "," + f.get("NAME")))

  sample.foreach(println)
  LOG.info("SAMPLE TABLE: " + name.toString())
  sc.stop()

} catch {
  case e: Exception => {
e.printStackTrace()
val msg = e.getMessage
LOG.error("Phoenix Testing failure: errorMsg: " + msg)
  }
}
  }


> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> 

[jira] [Updated] (SPARK-10297) When save data to a data source table, we should bound the size of a saved file

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10297:
-
Target Version/s:   (was: 1.6.0)

> When save data to a data source table, we should bound the size of a saved 
> file
> ---
>
> Key: SPARK-10297
> URL: https://issues.apache.org/jira/browse/SPARK-10297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we save a table to a data source table, it is possible that a writer is 
> responsible to write out a larger number of rows, which can make the 
> generated file very large and cause job failed if the underlying storage 
> system has a limit of max file size (e.g. S3's limit is 5GB). We should bound 
> the size of a file generated by a writer and create new writers for the same 
> partition if necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9182:

Target Version/s:   (was: 1.6.0)

> filter and groupBy on DataFrames are not passed through to jdbc source
> --
>
> Key: SPARK-9182
> URL: https://issues.apache.org/jira/browse/SPARK-9182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Yijie Shen
>Priority: Critical
>
> When running all of these API calls, the only one that passes the filter 
> through to the backend jdbc source is equality.  All filters in these 
> commands should be able to be passed through to the jdbc database source.
> {code}
> val url="jdbc:postgresql:grahn"
> val prop = new java.util.Properties
> val emp = sqlContext.read.jdbc(url, "emp", prop)
> emp.filter(emp("sal") === 5000).show()
> emp.filter(emp("sal") < 5000).show()
> emp.filter("sal = 3000").show()
> emp.filter("sal > 2500").show()
> emp.filter("sal >= 2500").show()
> emp.filter("sal < 2500").show()
> emp.filter("sal <= 2500").show()
> emp.filter("sal != 3000").show()
> emp.filter("sal between 3000 and 5000").show()
> emp.filter("ename in ('SCOTT','BLAKE')").show()
> {code}
> We see from the PostgreSQL query log the following is run, and see that only 
> equality predicates are passed through.
> {code}
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 5000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 3000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10146:
-
Target Version/s:   (was: 1.6.0)

> Have an easy way to set data source reader/writer specific confs
> 
>
> Key: SPARK-10146
> URL: https://issues.apache.org/jira/browse/SPARK-10146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, it is hard to set data source reader/writer specifics confs 
> correctly (e.g. parquet's row group size). Users need to set those confs in 
> hadoop conf before start the application or through 
> {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
> great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9139:

Target Version/s:   (was: 1.6.0)

> Add backwards-compatibility tests for DataType.fromJson()
> -
>
> Key: SPARK-9139
> URL: https://issues.apache.org/jira/browse/SPARK-9139
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Josh Rosen
>Priority: Critical
>
> SQL's DataType.fromJson is a public API and thus must be 
> backwards-compatible; there are also backwards-compatibility concerns related 
> to persistence of DataType JSON in metastores.
> Unfortunately, we do not have any backwards-compatibility tests which attempt 
> to read old JSON values that were written by earlier versions of Spark.  
> DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this 
> doesn't ensure compatibility.
> I think that we should address this by capuring the JSON strings produced in 
> Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
> from those strings.
> This might be a good starter task for someone who wants to contribute to SQL 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10343) Consider nullability of expression in codegen

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10343:
-
Target Version/s:   (was: 1.6.0)

> Consider nullability of expression in codegen
> -
>
> Key: SPARK-10343
> URL: https://issues.apache.org/jira/browse/SPARK-10343
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> In codegen, we didn't consider nullability of expressions. Once considering 
> this, we can avoid lots of null check (reduce the size of generated code, 
> also improve performance).
> Before that, we should double-check the correctness of nullablity of all 
> expressions and schema, or we will hit NPE or wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9431) TimeIntervalType for for time intervals

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9431:

Target Version/s:   (was: 1.6.0)

> TimeIntervalType for for time intervals
> ---
>
> Key: SPARK-9431
> URL: https://issues.apache.org/jira/browse/SPARK-9431
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Related to the existing CalendarIntervalType, TimeIntervalType internally has 
> only one component: the number of microseoncds, represented as a long.
> TimeIntervalType can be used in equality test and ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9410) Better Multi-User Session Semantics for SQL Context

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9410.
-
   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.6.0

> Better Multi-User Session Semantics for SQL Context
> ---
>
> Key: SPARK-9410
> URL: https://issues.apache.org/jira/browse/SPARK-9410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.0
>
>
> SQLContext defines methods to attach and detach sessions.  However, this code 
> is poorly tested and thus currently broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7492) Convert LocalDataFrame to LocalMatrix

2015-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987098#comment-14987098
 ] 

Michael Armbrust commented on SPARK-7492:
-

Are we still trying to get this in for 1.6?

> Convert LocalDataFrame to LocalMatrix
> -
>
> Key: SPARK-7492
> URL: https://issues.apache.org/jira/browse/SPARK-7492
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Burak Yavuz
>
> Having a method like, 
> {code:java}
> Matrices.fromDataFrame(df)
> {code}
> would provide users the ability to perform feature selection with DataFrames.
> Users will be able to chain operations like below:
> {code:java}
> import org.apache.spark.mllib.linalg.Matrices
> import org.apache.spark.mllib.stat.Statistics
> import org.apache.spark.sql.DataFrame
> val df = ... // the DataFrame
> val contingencyTable = df.stat.crosstab(col1, col2)
> val ct = Matrices.fromDataFrame(contingencyTable)
> val result: ChiSqTestResult = Statistics.chiSqTest(ct)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message

2015-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11218:


Assignee: Apache Spark

> `./sbin/start-slave.sh --help` should print out the help message
> 
>
> Key: SPARK-11218
> URL: https://issues.apache.org/jira/browse/SPARK-11218
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> Reading the sources has showed that the command {{./sbin/start-slave.sh 
> --help}} should print out the help message. It doesn't really.
> {code}
> ➜  spark git:(master) ✗ ./sbin/start-slave.sh --help
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
> --properties-file FILE   Path to a custom Spark properties file.
>  Default is conf/spark-defaults.conf.
> full log in 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11218) `./sbin/start-slave.sh --help` should print out the help message

2015-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987100#comment-14987100
 ] 

Apache Spark commented on SPARK-11218:
--

User 'CharlesYeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/9432

> `./sbin/start-slave.sh --help` should print out the help message
> 
>
> Key: SPARK-11218
> URL: https://issues.apache.org/jira/browse/SPARK-11218
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Reading the sources has showed that the command {{./sbin/start-slave.sh 
> --help}} should print out the help message. It doesn't really.
> {code}
> ➜  spark git:(master) ✗ ./sbin/start-slave.sh --help
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
> --properties-file FILE   Path to a custom Spark properties file.
>  Default is conf/spark-defaults.conf.
> full log in 
> /Users/jacek/dev/oss/spark/sbin/../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10823) API design: external state management

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10823:
-
Target Version/s:   (was: 1.6.0)

> API design: external state management
> -
>
> Key: SPARK-10823
> URL: https://issues.apache.org/jira/browse/SPARK-10823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10819) Logical plan: determine logical operators needed

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10819:
-
Target Version/s:   (was: 1.6.0)

> Logical plan: determine logical operators needed
> 
>
> Key: SPARK-10819
> URL: https://issues.apache.org/jira/browse/SPARK-10819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Again, it would be great if we can just reuse Spark SQL's existing logical 
> plan. We might need to introduce new logical plans (e.g. windowing which is 
> different from Spark SQL's).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10820) Physical plan: determine physical operators needed

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10820:
-
Target Version/s:   (was: 1.6.0)

> Physical plan: determine physical operators needed
> --
>
> Key: SPARK-10820
> URL: https://issues.apache.org/jira/browse/SPARK-10820
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10814) API design: convergence of batch and streaming DataFrame

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10814:
-
Target Version/s:   (was: 1.6.0)

> API design: convergence of batch and streaming DataFrame
> 
>
> Key: SPARK-10814
> URL: https://issues.apache.org/jira/browse/SPARK-10814
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This might not be possible since the batch DataFrame has a lot of functions 
> convenient for interactivity that don't make sense in streaming, but it would 
> be great if we can have an abstraction that is common (e.g. a common 
> ancestor) that can be used by other libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10813) API design: high level class structuring regarding windowed and non-windowed streams

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10813:
-
Target Version/s:   (was: 1.6.0)

> API design: high level class structuring regarding windowed and non-windowed 
> streams
> 
>
> Key: SPARK-10813
> URL: https://issues.apache.org/jira/browse/SPARK-10813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> I can think of 3 high level alternatives for streaming data frames. See
> https://github.com/rxin/spark/pull/17



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10803) Allow users to write and query Parquet user-defined key-value metadata directly

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10803:
-
Target Version/s:   (was: 1.6.0)

> Allow users to write and query Parquet user-defined key-value metadata 
> directly
> ---
>
> Key: SPARK-10803
> URL: https://issues.apache.org/jira/browse/SPARK-10803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.0
>Reporter: Cheng Lian
>
> Currently Spark SQL only allows users to set and get per-column metadata of a 
> DataFrame. This metadata can be then persisted to Parquet as part of Catalyst 
> schema information contained in the user-defined key-value metadata. It would 
> be nice if we can allow users to write and query Parquet user-defined 
> key-value metadata directly. Or maybe a more general way to allow DataFrame 
> level (rather than column level) metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10816) API design: window and session specification

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10816:
-
Target Version/s:   (was: 1.6.0)

> API design: window and session specification
> 
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9557) Refactor ParquetFilterSuite and remove old ParquetFilters code

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9557:

Target Version/s:   (was: 1.6.0)

> Refactor ParquetFilterSuite and remove old ParquetFilters code
> --
>
> Key: SPARK-9557
> URL: https://issues.apache.org/jira/browse/SPARK-9557
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently there are two Parquet filter conversion code path, one for 
> converting data sources {{Filter}}, the other for converting Catalyst 
> predicate {{Expression}}, which is used by the removed old Parquet code. We 
> should remove the latter, but {{ParquetFilterSuite}} uses it to test Parquet 
> filter push-down.
> Need to refactor {{ParquetFilterSuite}} to make it test the data source 
> version and then remove the old Parquet filter conversion code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8682) Range Join for Spark SQL

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8682:

Target Version/s:   (was: 1.6.0)

> Range Join for Spark SQL
> 
>
> Key: SPARK-8682
> URL: https://issues.apache.org/jira/browse/SPARK-8682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
> Attachments: perf_testing.scala
>
>
> Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered 
> Cartesian Join) when it has to execute the following range query:
> {noformat}
> SELECT A.*,
>B.*
> FROM   tableA A
>JOIN tableB B
> ON A.start <= B.end
>  AND A.end > B.start
> {noformat}
> This is horribly inefficient. The performance of this query can be greatly 
> improved, when one of the tables can be broadcasted, by creating a range 
> index. A range index is basically a sorted map containing the rows of the 
> smaller table, indexed by both the high and low keys. using this structure 
> the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = 
> number of records in the larger table, M = number of records in the smaller 
> (indexed) table.
> I have created a pull request for this. According to the [Spark SQL: 
> Relational Data Processing in 
> Spark|http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf] 
> paper similar work (page 11, section 7.2) has already been done by the ADAM 
> project (cannot locate the code though). 
> Any comments and/or feedback are greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8641) Native Spark Window Functions

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8641:

Target Version/s:   (was: 1.6.0)

> Native Spark Window Functions
> -
>
> Key: SPARK-8641
> URL: https://issues.apache.org/jira/browse/SPARK-8641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Herman van Hovell
>
> *Rationale*
> The window operator currently uses Hive UDAFs for all aggregation operations. 
> This is fine in terms of performance and functionality. However they limit 
> extensibility, and they are quite opaque in terms of processing and memory 
> usage. The later blocks advanced optimizations such as code generation and 
> tungsten style (advanced) memory management.
> *Requirements*
> We want to adress this by replacing the Hive UDAFs with native Spark SQL 
> UDAFs. A redesign of the Spark UDAFs is currently underway, see SPARK-4366. 
> The new window UDAFs should use this new standard, in order to make them as 
> future proof as possible. Although we are replacing the standard Hive UDAFs, 
> other existing Hive UDAFs should still be supported.
> The new window UDAFs should, at least, cover all existing Hive standard 
> window UDAFs:
> # FIRST_VALUE
> # LAST_VALUE
> # LEAD
> # LAG
> # ROW_NUMBER
> # RANK
> # DENSE_RANK
> # PERCENT_RANK
> # NTILE
> # CUME_DIST
> All these function imply a row order; this means that in order to use these 
> functions properly an
> ORDER BY clause must be defined.
> The first and last value UDAFs are already present in Spark SQL. The only 
> thing which needs to be added is skip NULL functionality.
> LEAD and LAG are not aggregates. These expressions return the value of an 
> expression a number of rows before (LAG) or ahead (LEAD) of the current row. 
> These expression put a constraint on the Window frame in which they are 
> executed: this can only be a Row frame with equal offsets.
> The ROW_NUMBER() function can be seen as a count in a running row frame (ROWS 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
> RANK(), DENSE_RANK(), PERCENT_RANK(), NTILE(..) & CUME_DIST() are dependent 
> on the actual value of values in the ORDER BY clause. The ORDER BY 
> expression(s) must be made available before these functions are evaluated. 
> All these functions will have a fixed frame, but this will be dependent on 
> the implementation (probably a running row frame).
> PERCENT_RANK(), NTILE(..) & CUME_DIST() are also dependent on the size of the 
> partition being evaluated. The partition size must either be made available 
> during evaluation (this is perfectly feasible in the current implementation) 
> or the expression must be divided over two window and a merging expression, 
> for instance PERCENT_RANK() would look like this:
> {noformat}
> (RANK() OVER (PARTITION BY x ORDER BY y) - 1) / (COUNT(*) OVER (PARTITION BY 
> x) - 1)
> {noformat}
> *Design*
> The old WindowFunction interface will be replaced by the following 
> (initial/very early) design (including sub-classes):
> {noformat}
> /**
>  * A window function is a function that can only be evaluated in the context 
> of a window operator.
>  */
> trait WindowFunction {
>   self: Expression =>
>   /**
>* Define the frame in which the window operator must be executed.
>*/
>   def frame: WindowFrame = UnspecifiedFrame
> }
> /**
>  * Base class for LEAD/LAG offset window functions.
>  *
>  * These are ordinary expressions, the idea is that the Window operator will 
> process these in a
>  * separate (specialized) window frame.
>  */
> abstract class OffsetWindowFunction(val child: Expression, val offset: Int, 
> val default: Expression) {
>   override def deterministic: Boolean = false
>   ...
> }
> case class Lead(child: Expression, offset: Int, default: Expression) extends 
> OffsetWindowFunction(child, offset, default) {
>   override val frame = SpecifiedWindowFrame(RowFrame, ValuePreceding(offset), 
> ValuePreceding(offset))
>   ...
> }
> case class Lag(child: Expression, offset: Int, default: Expression) extends 
> OffsetWindowFunction(child, offset, default) {
>   override val frame = SpecifiedWindowFrame(RowFrame, ValueFollowing(offset), 
> ValueFollowing(offset))
>   ...
> }
> case class RowNumber() extends AlgebraicAggregate with WindowFunction {
>   override def deterministic: Boolean = false
>   override val frame = SpecifiedWindowFrame(RowFrame, UnboundedPreceding, 
> CurrentRow)
>   ...
> }
> abstact class RankLike(val order: Seq[Expression] = Nil) extends 
> AlgebraicAggregate with WindowFunction {
>   override def deterministic: Boolean = true
>   // This can be injected by either the Planner or the Window operator.
>   def withOrderSpec(orderSpec: Seq[Expression]): AggregateWindowFuntion
>   // This will be injected by the Window 

[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2870:

Target Version/s:   (was: 1.6.0)

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9850:

Target Version/s:   (was: 1.6.0)

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10270) Add/Replace some Java friendly DataFrame API

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-10270.

Resolution: Won't Fix

> Add/Replace some Java friendly DataFrame API
> 
>
> Key: SPARK-10270
> URL: https://issues.apache.org/jira/browse/SPARK-10270
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>
> Currently in DataFrame, we have API like:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String]): DataFrame
> def dropDuplicates(colNames: Seq[String]): DataFrame
> def dropDuplicates(colNames: Array[String]): DataFrame
> {code}
> Those API not like the so friendly to Java programmers, change it to:
> {code}
> def join(right: DataFrame, usingColumns: String*): DataFrame
> def dropDuplicates(colNames: String*): DataFrame
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-11-03 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987046#comment-14987046
 ] 

Daniel Darabos commented on SPARK-1239:
---

I can also add some data. I have a ShuffleMapStage with 82,714 tasks and then a 
ResultStage with 222,609 tasks. The driver cannot serialize this:

{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271) ~[na:1.7.0_79]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) 
~[na:1.7.0_79]
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
~[na:1.7.0_79]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) 
~[na:1.7.0_79]
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) 
~[na:1.7.0_79]
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) 
~[na:1.7.0_79]
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:146) 
~[na:1.7.0_79]
at 
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1893)
 ~[na:1.7.0_79]
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1874)
 ~[na:1.7.0_79]
at 
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1821)
 ~[na:1.7.0_79]
at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:718) 
~[na:1.7.0_79]
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:739) 
~[na:1.7.0_79]
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:362)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1294) 
~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:361)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:312)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at 
org.apache.spark.MapOutputTrackerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(MapOutputTracker.scala:49)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
{noformat}

I see {{getSerializedMapOutputStatuses}} has changed a lot since 1.4.0 but it 
still returns an array sized proportional to _M * R_. How can this be part of a 
scalable system? How is this not a major issue for everyone? Am I doing 
something wrong?

I'm now thinking that maybe if you have an overwhelming majority of empty or 
non-empty blocks, the bitmap will compress very well. But it's possible that I 
am ending up with a relatively even mix of empty and non-empty blocks, killing 
the compression. I have about 40 billion lines, _M * R_ is about 20 billion, so 
this seems plausible.

It's also possible that I should have larger partitions. Due to the processing 
I do it's not possible -- it leads to the executors OOMing. But larger 
partitions would not be a scalable solution anyway. If _M_ and _R_ are 
reasonable now with some number of lines per partition, then when your data 
size doubles they will also double and _M * R_ will quadruple. At some point 
the number of lines per map output will be low enough that compression becomes 
ineffective.

I see https://issues.apache.org/jira/browse/SPARK-11271 has recently decreased 
the map status size by 20%. That means in Spark 1.6 I will be able to process 
1/sqrt(0.8) or 12% more data than now. The way I understand the situation the 
improvement required is orders of magnitude larger than that. I'm currently 
hitting this issue with 5 TB of input. If I tried processing 5 PB, the map 
status would be a million times larger.

I like the premise of this JIRA ticket of not building the map status table in 
the first place. But a colleague of mine asks if perhaps we could even avoid 
tracking this data in the driver. If the driver just provided the reducers with 
the list of mappers they could each just ask the mappers directly for the list 
of blocks they should fetch.

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Updated] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9701:

Target Version/s:   (was: 1.6.0)

> allow not automatically using HiveContext with spark-shell when hive support 
> built in
> -
>
> Key: SPARK-9701
> URL: https://issues.apache.org/jira/browse/SPARK-9701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Thomas Graves
>
> I build the spark jar with hive support as most of our grids have Hive.  We 
> were bringing up a new YARN cluster that didn't have hive installed on it yet 
> which results in the spark-shell failing to launch:
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> It would be nice to have a config or something  to tell it not to instantiate 
> a HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9783:

Target Version/s:   (was: 1.6.0)

> Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
> -
>
> Key: SPARK-9783
> URL: https://issues.apache.org/jira/browse/SPARK-9783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
> {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
> performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
> override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
> as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9876) Upgrade parquet-mr to 1.8.1

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9876:

Target Version/s:   (was: 1.6.0)

> Upgrade parquet-mr to 1.8.1
> ---
>
> Key: SPARK-9876
> URL: https://issues.apache.org/jira/browse/SPARK-9876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example 
> PARQUET-201 (SPARK-9407).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10621:
-
Priority: Critical  (was: Major)

> Audit function names in FunctionRegistry and corresponding method names shown 
> in functions.scala and functions.py
> -
>
> Key: SPARK-10621
> URL: https://issues.apache.org/jira/browse/SPARK-10621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, there are a few places that we are not very consistent.
> * There are a few functions that are registered in {{FunctionRegistry}}, but 
> not provided in {{functions.scala}} and {{functions.py}}. Examples are 
> {{isnull}} and {{get_json_object}}.
> * There are a few functions that we have different names in FunctionRegistry 
> and method name in DataFrame API. {{spark_partition_id}} is an example. In 
> FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, 
> the method is called {{sparkPartitionId}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10429) MutableProjection should evaluate all expressions first and then update the mutable row

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10429.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9422
[https://github.com/apache/spark/pull/9422]

> MutableProjection should evaluate all expressions first and then update the 
> mutable row
> ---
>
> Key: SPARK-10429
> URL: https://issues.apache.org/jira/browse/SPARK-10429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0
>
>
> Right now, SQL's mutable projection updates every value of the mutable 
> project after it evaluates the corresponding expression. This makes the 
> behavior of MutableProjection confusing and complicate the implementation of 
> common aggregate functions like stddev because developers need to be aware 
> that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th 
> slot of the mutable row has already been updated.
> A better behavior of MutableProjection will be that we evaluate all 
> expressions first and then update all values of the mutable row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time

2015-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987113#comment-14987113
 ] 

Apache Spark commented on SPARK-11460:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9433

> Locality waits should be based on task set creation time, not last launch time
> --
>
> Key: SPARK-11460
> URL: https://issues.apache.org/jira/browse/SPARK-11460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: YARN
>Reporter: Shengyue Ji
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Spark waits for spark.locality.waits period before going from RACK_LOCAL to 
> ANY when selecting an executor for assignment. The timeout was essentially 
> reset each time a new assignment is made.
> We were running Spark streaming on Kafka with a 10 second batch window on 32 
> Kafka partitions with 16 executors. All executors were in the ANY group. At 
> one point one RACK_LOCAL executor was added and all tasks were assigned to 
> it. Each task took about 0.6 second to process, resetting the 
> spark.locality.wait timeout (3000ms) repeatedly. This caused the whole 
> process to under utilize resources and created an increasing backlog.
> spark.locality.wait should be based on the task set creation time, not last 
> launch time so that after 3000ms of initial creation, all executors can get 
> tasks assigned to them.
> We are specifying a zero timeout for now as a workaround to disable locality 
> optimization. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time

2015-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11460:


Assignee: (was: Apache Spark)

> Locality waits should be based on task set creation time, not last launch time
> --
>
> Key: SPARK-11460
> URL: https://issues.apache.org/jira/browse/SPARK-11460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: YARN
>Reporter: Shengyue Ji
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Spark waits for spark.locality.waits period before going from RACK_LOCAL to 
> ANY when selecting an executor for assignment. The timeout was essentially 
> reset each time a new assignment is made.
> We were running Spark streaming on Kafka with a 10 second batch window on 32 
> Kafka partitions with 16 executors. All executors were in the ANY group. At 
> one point one RACK_LOCAL executor was added and all tasks were assigned to 
> it. Each task took about 0.6 second to process, resetting the 
> spark.locality.wait timeout (3000ms) repeatedly. This caused the whole 
> process to under utilize resources and created an increasing backlog.
> spark.locality.wait should be based on the task set creation time, not last 
> launch time so that after 3000ms of initial creation, all executors can get 
> tasks assigned to them.
> We are specifying a zero timeout for now as a workaround to disable locality 
> optimization. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11412) Support merge schema for ORC

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11412:
-
Summary: Support merge schema for ORC  (was: mergeSchema option not working 
for orc format?)

> Support merge schema for ORC
> 
>
> Key: SPARK-11412
> URL: https://issues.apache.org/jira/browse/SPARK-11412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dave
>
> when I tried to load partitioned orc files with a slight difference in a 
> nested column. say 
> column 
> -- request: struct (nullable = true)
>  ||-- datetime: string (nullable = true)
>  ||-- host: string (nullable = true)
>  ||-- ip: string (nullable = true)
>  ||-- referer: string (nullable = true)
>  ||-- request_uri: string (nullable = true)
>  ||-- uri: string (nullable = true)
>  ||-- useragent: string (nullable = true)
> And then there's a page_url_lists attributes in the later partitions.
> I tried to use
> val s = sqlContext.read.format("orc").option("mergeSchema", 
> "true").load("/data/warehouse/") to load the data.
> But the schema doesn't show request.page_url_lists.
> I am wondering if schema merge doesn't work for orc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11412) Support merge schema for ORC

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11412:
-
Issue Type: New Feature  (was: Bug)

> Support merge schema for ORC
> 
>
> Key: SPARK-11412
> URL: https://issues.apache.org/jira/browse/SPARK-11412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Dave
>
> when I tried to load partitioned orc files with a slight difference in a 
> nested column. say 
> column 
> -- request: struct (nullable = true)
>  ||-- datetime: string (nullable = true)
>  ||-- host: string (nullable = true)
>  ||-- ip: string (nullable = true)
>  ||-- referer: string (nullable = true)
>  ||-- request_uri: string (nullable = true)
>  ||-- uri: string (nullable = true)
>  ||-- useragent: string (nullable = true)
> And then there's a page_url_lists attributes in the later partitions.
> I tried to use
> val s = sqlContext.read.format("orc").option("mergeSchema", 
> "true").load("/data/warehouse/") to load the data.
> But the schema doesn't show request.page_url_lists.
> I am wondering if schema merge doesn't work for orc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9357) Remove JoinedRow

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9357:

Target Version/s:   (was: 1.6.0)

> Remove JoinedRow
> 
>
> Key: SPARK-9357
> URL: https://issues.apache.org/jira/browse/SPARK-9357
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key 
> and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row 
> is actually read. Given all the fields will be read almost all the time 
> (otherwise they get pruned out by the optimizer), branch predictor cannot do 
> anything about those branches.
> I think a better way is just to remove this thing, and materializes the row 
> data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9487:

Target Version/s:   (was: 1.6.0)

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11436) we should rebind right encoder when join 2 datasets

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11436.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9391
[https://github.com/apache/spark/pull/9391]

> we should rebind right encoder when join 2 datasets
> ---
>
> Key: SPARK-11436
> URL: https://issues.apache.org/jira/browse/SPARK-11436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11436) we should rebind right encoder when join 2 datasets

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11436:
-
Assignee: Wenchen Fan

> we should rebind right encoder when join 2 datasets
> ---
>
> Key: SPARK-11436
> URL: https://issues.apache.org/jira/browse/SPARK-11436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10681) DateTimeUtils needs a method to parse string to SQL's timestamp value

2015-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987067#comment-14987067
 ] 

Michael Armbrust commented on SPARK-10681:
--

Can we bump this from 1.6?

> DateTimeUtils needs a method to parse string to SQL's timestamp value
> -
>
> Key: SPARK-10681
> URL: https://issues.apache.org/jira/browse/SPARK-10681
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, {{DateTimeUtils.stringToTime}} returns a java.util.Date whose 
> getTime returns milliseconds. It will be great if we have a method to parse 
> string directly to SQL's timestamp value (microseconds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9604) Unsafe ArrayData and MapData is very very slow

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9604:

Target Version/s:   (was: 1.6.0)

> Unsafe ArrayData and MapData is very very slow
> --
>
> Key: SPARK-9604
> URL: https://issues.apache.org/jira/browse/SPARK-9604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Critical
>
> After the unsafe ArrayData and MapData merged in, this test become very slow 
> (from less than 1 second to more than 35 seconds).
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/3157/testReport/org.apache.spark.sql.columnar/InMemoryColumnarQuerySuite/test_different_data_types/history/
> I tried to disable the cache, it's still very slow (also most the same), once 
> remove ArrayData and ArrayMap, it become much faster (still take about 10 
> seconds).
> Related changes: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/3148/changes
> Also the duration of Hive tests increased from 32min to 45min 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/3154/testReport/junit/org.apache.spark.sql.hive.execution/history/
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5517) Add input types for Java UDFs

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5517:

Target Version/s:   (was: 1.6.0)

> Add input types for Java UDFs
> -
>
> Key: SPARK-5517
> URL: https://issues.apache.org/jira/browse/SPARK-5517
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4366) Aggregation Improvement

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4366:

Target Version/s:   (was: 1.6.0)

> Aggregation Improvement
> ---
>
> Key: SPARK-4366
> URL: https://issues.apache.org/jira/browse/SPARK-4366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Critical
> Attachments: aggregatefunction_v1.pdf
>
>
> This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10129) math function: stddev_samp

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10129:
-
Target Version/s:   (was: 1.6.0)

> math function: stddev_samp
> --
>
> Key: SPARK-10129
> URL: https://issues.apache.org/jira/browse/SPARK-10129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Use the STDDEV_SAMP function to return the standard deviation of a sample 
> variance.
> http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_stdev_samp.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9218) Falls back to getAllPartitions when getPartitionsByFilter fails

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9218:

Target Version/s:   (was: 1.6.0)

> Falls back to getAllPartitions when getPartitionsByFilter fails
> ---
>
> Key: SPARK-9218
> URL: https://issues.apache.org/jira/browse/SPARK-9218
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> [PR #7492|https://github.com/apache/spark/pull/7492] enables Hive partition 
> predicate push-down by leveraging {{Hive.getPartitionsByFilter}}. Although 
> this optimization is pretty effective, we did observe some failures like this:
> {noformat}
> java.sql.SQLDataException: Invalid character string format for type DECIMAL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeQuery(Unknown Source)
>   at 
> com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:174)
>   at 
> org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381)
>   at 
> org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504)
>   at 
> org.datanucleus.store.rdbms.query.SQLQuery.performExecute(SQLQuery.java:280)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1786)
>   at 
> org.datanucleus.store.query.AbstractSQLQuery.executeWithArray(AbstractSQLQuery.java:339)
>   at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:312)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:300)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilter(MetaStoreDirectSql.java:211)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2320)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2317)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2208)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:2317)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2165)
>   at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)
>   at com.sun.proxy.$Proxy21.getPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:3760)
>   at sun.reflect.GeneratedMethodAccessor125.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>   at com.sun.proxy.$Proxy23.get_partitions_by_filter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:903)
>   at sun.reflect.GeneratedMethodAccessor124.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>   at com.sun.proxy.$Proxy24.listPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:1944)
>   at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
>   at 
> 

[jira] [Updated] (SPARK-3864) Specialize join for tables with unique integer keys

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3864:

Target Version/s:   (was: 1.6.0)

> Specialize join for tables with unique integer keys
> ---
>
> Key: SPARK-3864
> URL: https://issues.apache.org/jira/browse/SPARK-3864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We can create a new operator that uses an array as the underlying storage to 
> avoid hash lookups entirely for dimension tables that have integer keys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3863) Cache broadcasted tables and reuse them across queries

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3863:

Target Version/s:   (was: 1.6.0)

> Cache broadcasted tables and reuse them across queries
> --
>
> Key: SPARK-3863
> URL: https://issues.apache.org/jira/browse/SPARK-3863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> There is no point re-broadcasting the same dataset every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6377) Set the number of shuffle partitions for Exchange operator automatically based on the size of input tables and the reduce-side operation.

2015-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987104#comment-14987104
 ] 

Michael Armbrust commented on SPARK-6377:
-

How does this relate to what we have done?  Are we still aiming for Spark 1.6?

> Set the number of shuffle partitions for Exchange operator automatically 
> based on the size of input tables and the reduce-side operation.
> -
>
> Key: SPARK-6377
> URL: https://issues.apache.org/jira/browse/SPARK-6377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> It will be helpful to automatically set the number of shuffle partitions 
> based on the size of input tables and the operation at the reduce side for an 
> Exchange operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3860) Improve dimension joins

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3860:

Target Version/s:   (was: 1.6.0)

> Improve dimension joins
> ---
>
> Key: SPARK-3860
> URL: https://issues.apache.org/jira/browse/SPARK-3860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket for improving performance for joining multiple 
> dimension tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11328) Correctly propagate error message in the case of failures when writing parquet

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11328:
-
Target Version/s:   (was: 1.6.0)

> Correctly propagate error message in the case of failures when writing parquet
> --
>
> Key: SPARK-11328
> URL: https://issues.apache.org/jira/browse/SPARK-11328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> When saving data to S3 (e.g. saving to parquet), if there is an error during 
> the query execution, the partial file generated by the failed task will be 
> uploaded to S3 and the retries of this task will throw file already exist 
> error. It is very confusing to users because they may think that file already 
> exist error is the error causing the job failure. They can only find the real 
> error in the spark ui (in the stage page).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8513:

Target Version/s:   (was: 1.6.0)

> _temporary may be left undeleted when a write job committed with 
> FileOutputCommitter fails due to a race condition
> --
>
> Key: SPARK-8513
> URL: https://issues.apache.org/jira/browse/SPARK-8513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>
> To reproduce this issue, we need a node with relatively more cores, say 32 
> (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
> following code should be relatively easy to reproduce this issue:
> {code}
> sqlContext.range(0, 10).repartition(32).select('id / 
> 0).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> You may observe similar log lines as below:
> {noformat}
> 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
> WARN FileUtil: Failed to delete file or dir 
> [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
>  it still exists.
> {noformat}
> The reason is that, for a Spark job with multiple tasks, when a task fails 
> after multiple retries, the job gets canceled on driver side.  At the same 
> time, all child tasks of this job also get canceled.  However, task 
> cancelation is asynchronous.  This means, some tasks may still be running 
> when the job is already killed on driver side.
> With this in mind, the following execution order may cause the log line 
> mentioned above:
> # Job {{A}} spawns 32 tasks to write the Parquet file
>   Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
> temporary directory {{D1}} is created to hold output files of different task 
> attempts.
> # Task {{a1}} fails after several retries first because of the division by 
> zero error
> # Task {{a1}} aborts the Parquet write task and tries to remove its task 
> attempt output directory {{d1}} (a sub-directory of {{D1}})
> # Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
> canceled *asynchronously*
> # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
> removing all its child files/directories first
>   Note that when testing with local directory, {{RawLocalFileSystem}} simply 
> calls {{java.io.File.delete()}} to deletion, and only empty directories can 
> be deleted.
> # Because tasks are canceled asynchronously, some other task, say {{a2}}, may 
> just get scheduled and create its own task attempt directory {{d2}} under 
> {{D1}}
> # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
> itself, but fails because {{d2}} makes {{D1}} non-empty again
> Notice that this bug affects all Spark jobs that writes files with 
> {{FileOutputCommitter}} and its subclasses which create and delete temporary 
> directories.
> One of the possible way to fix this issue can be making task cancellation 
> synchronous, but this also increases latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10345) Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10345:
-
Target Version/s:   (was: 1.6.0)

> Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate
> --
>
> Key: SPARK-10345
> URL: https://issues.apache.org/jira/browse/SPARK-10345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41759/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/nonblock_op_deduplicate/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11412) Support merge schema for ORC

2015-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987127#comment-14987127
 ] 

Michael Armbrust commented on SPARK-11412:
--

This is only currently supported for parquet.

> Support merge schema for ORC
> 
>
> Key: SPARK-11412
> URL: https://issues.apache.org/jira/browse/SPARK-11412
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Dave
>
> when I tried to load partitioned orc files with a slight difference in a 
> nested column. say 
> column 
> -- request: struct (nullable = true)
>  ||-- datetime: string (nullable = true)
>  ||-- host: string (nullable = true)
>  ||-- ip: string (nullable = true)
>  ||-- referer: string (nullable = true)
>  ||-- request_uri: string (nullable = true)
>  ||-- uri: string (nullable = true)
>  ||-- useragent: string (nullable = true)
> And then there's a page_url_lists attributes in the later partitions.
> I tried to use
> val s = sqlContext.read.format("orc").option("mergeSchema", 
> "true").load("/data/warehouse/") to load the data.
> But the schema doesn't show request.page_url_lists.
> I am wondering if schema merge doesn't work for orc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >