date:20160531


 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15691:

Summary: Refactor and improve Hive support  (was: Refactor Hive support)

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-05-31 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-14343:
--

Assignee: Cheng Lian

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Assignee: Cheng Lian
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15691) Refactor Hive support


 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15691:

Description: 
Hive support is important to Spark SQL, as many Spark users use it to read from 
Hive. The current architecture is very difficult to maintain, and this ticket 
tracks progress towards getting us to a sane state.

A number of things we want to 



  was:
Hive support is important to Spark SQL, as many Spark users use it to read from 
Hive. The current architecture is very difficult to maintain, and this ticket 
tracks progress towards getting us to a sane state.




> Refactor Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15691) Refactor Hive support

Reynold Xin created SPARK-15691:
---

 Summary: Refactor Hive support
 Key: SPARK-15691
 URL: https://issues.apache.org/jira/browse/SPARK-15691
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


Hive support is important to Spark SQL, as many Spark users use it to read from 
Hive. The current architecture is very difficult to maintain, and this ticket 
tracks progress towards getting us to a sane state.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14441) Consolidate DDL tests


 [ 
https://issues.apache.org/jira/browse/SPARK-14441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14441.
--
Resolution: Later

seems we will not do it for 2.0.0. Let's resolve it as "Later".

> Consolidate DDL tests
> -
>
> Key: SPARK-14441
> URL: https://issues.apache.org/jira/browse/SPARK-14441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Today we have DDLSuite, DDLCommandSuite, HiveDDLCommandSuite. It's confusing 
> whether a test should exist in one or the other. It also makes it less clear 
> whether our test coverage is comprehensive. Ideally we should consolidate 
> these files as much as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14118) Implement DDL/DML commands for Spark 2.0


 [ 
https://issues.apache.org/jira/browse/SPARK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14118.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Implement DDL/DML commands for Spark 2.0
> 
>
> Key: SPARK-14118
> URL: https://issues.apache.org/jira/browse/SPARK-14118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Right now, we have many DDL/DML commands that are passed to Hive, which may 
> cause missing functionality, failures with bad error messages, or 
> inconsistent behaviors (e.g. a command that works with some cases but fails 
> for other cases). For Spark 2.0, it will be great to not ask Hive to process 
> those DDL/DML commands.
> You can find the doc at 
> https://issues.apache.org/jira/secure/attachment/12793435/Implementing%20native%20DDL%20and%20DML%20statements%20for%20Spark%202.pdf
>  (under SPARK-13879).
> There are mainly two kinds of commands, (1) Native, i.e. we want to have 
> native implementation in Spark; (2) Exception, i.e. we should throw an 
> exception. That doc has a few commands that are marked as TBD. We should 
> first throw exceptions for them. 
> Sub-tasks are created based on the doc. A command is represented by its 
> corresponding Token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15690:

Description: 
Spark's current shuffle implementation sorts all intermediate data by their 
partition id, and then write the data to disk. This is not a big bottleneck 
because the network throughput on commodity clusters tend to be low. However, 
an increasing number of Spark users are using the system to process data on a 
single-node. When in a single node operating against intermediate data that 
fits in memory, the existing shuffle code path can become a big bottleneck.

The goal of this ticket is to change Spark so it can use in-memory radix sort 
to do data shuffling on a single node, and still gracefully fallback to disk if 
the data size does not fit in memory. Given the number of partitions is usually 
small (say less than 256), it'd require only a single pass do to the radix sort 
with pretty decent CPU efficiency.

Note that there have been many in-memory shuffle attempts in the past. This 
ticket has a smaller scope (single-process), and aims to actually productionize 
this code.



  was:
An increasing number of Spark users are using the system to process data on a 
single-node. When in a single node operating against intermediate data that 
fits in memory, the existing shuffle code path can become a big bottleneck.

Ideally, Spark should be able to use in-memory radix sort to do data shuffling 
on a single node


> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures


[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036
 ] 

Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:45 AM:
--

So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append
GroovyClass groovyClass = (GroovyClass) compileClass;
JarArchiveEntry oneJarEntry = new JarArchiveEntry 
(String.format ("%s.class", groovyClass.getName ()));
oneJarEntry.setSize (groovyClass.getBytes ().length);
byte[] bytecodeOfClass = groovyClass.getBytes ();
oneJar.putArchiveEntry (oneJarEntry);
oneJar.write (bytecodeOfClass);
oneJar.closeArchiveEntry ();
}

// End it up
oneJar.finish ();
oneJar.close ();
} catch (Exception e) {
// Do something
}

// Append the JAR to the execution environment
sparkService.getSparkContext ().addJar (pathOfJar);

// Now we evaluate the previously compiled script below
// The executor tasks should by now have all *.class files in the JAR available;
GroovyShell.evaluate (scriptText, nameOfScript);
{noformat}

Any idea on how this can be improved? (eg. not using the addJar method and the 
requirement to not dehydrate the Groovy closures)

For now this works for us, after SPARK-13599 was fixed. However the extra code 
required to make it work could be a part of Spark instead, maybe using the 
broadcast mechanism or having a shared "cache" of classes around the cluster 
from where all the executor nodes can find/load-up dynamically compiled classes 
(eg. Groovy byte-code).


was (Author: antauri):
So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar

[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15690:

Summary: Fast single-node (single-process) in-memory shuffle  (was: Fast 
single-node  in-memory shuffle)

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> An increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> Ideally, Spark should be able to use in-memory radix sort to do data 
> shuffling on a single node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures


[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036
 ] 

Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:43 AM:
--

So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append
GroovyClass groovyClass = (GroovyClass) compileClass;
JarArchiveEntry oneJarEntry = new JarArchiveEntry 
(String.format ("%s.class", groovyClass.getName ()));
oneJarEntry.setSize (groovyClass.getBytes ().length);
byte[] bytecodeOfClass = groovyClass.getBytes ();
oneJar.putArchiveEntry (oneJarEntry);
oneJar.write (bytecodeOfClass);
oneJar.closeArchiveEntry ();
}

// End it up
oneJar.finish ();
oneJar.close ();
} catch (Exception e) {
// Do something
}

// Append the JAR to the execution environment
sparkService.getSparkContext ().addJar (pathOfJar);

// GroovyShell.evaluate (scriptText, nameOfScript) below;
{noformat}

Any idea on how this can be improved? (eg. not using the addJar method and the 
requirement to not dehydrate the Groovy closures)

For now this works for us, after SPARK-13599 was fixed. However the extra code 
required to make it work could be a part of Spark instead, maybe using the 
broadcast mechanism or having a shared "cache" of classes around the cluster 
from where all the executor nodes can find/load-up dynamically compiled classes 
(eg. Groovy byte-code).


was (Author: antauri):
So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append

[jira] [Created] (SPARK-15690) Fast single-node in-memory shuffle

Reynold Xin created SPARK-15690:
---

 Summary: Fast single-node  in-memory shuffle
 Key: SPARK-15690
 URL: https://issues.apache.org/jira/browse/SPARK-15690
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle, SQL
Reporter: Reynold Xin


An increasing number of Spark users are using the system to process data on a 
single-node. When in a single node operating against intermediate data that 
fits in memory, the existing shuffle code path can become a big bottleneck.

Ideally, Spark should be able to use in-memory radix sort to do data shuffling 
on a single node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results


[ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309317#comment-15309317
 ] 

Apache Spark commented on SPARK-14343:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13431

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results


 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14343:


Assignee: (was: Apache Spark)

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results


 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14343:


Assignee: Apache Spark

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Assignee: Apache Spark
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures


[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036
 ] 

Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:41 AM:
--

So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append
GroovyClass groovyClass = (GroovyClass) compileClass;
JarArchiveEntry oneJarEntry = new JarArchiveEntry 
(String.format ("%s.class", groovyClass.getName ()));
oneJarEntry.setSize (groovyClass.getBytes ().length);
byte[] bytecodeOfClass = groovyClass.getBytes ();
oneJar.putArchiveEntry (oneJarEntry);
oneJar.write (bytecodeOfClass);
oneJar.closeArchiveEntry ();
}

// End it up
oneJar.finish ();
oneJar.close ();
} catch (Exception e) {
// Do something
}

// Append the JAR to the execution environment
sparkService.getSparkContext ().addJar (pathOfJar);

// GroovyShell.evaluate (scriptText, nameOfScript) below;
{noformat}

Any idea on how this can be improved? (eg. not using the addJar method and the 
requirement to not dehydrate the Groovy closures)

For now this works for us, after SPARK-13599 was fixed. However the extra code 
required to make it work could be a part of Spark instead.


was (Author: antauri):
So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append
GroovyClass groovyClass = (GroovyClass) compileClass;
JarArchiveEntry oneJarEntry = new JarArchiveEntry 
(String.format ("%s.class", groovyClass.getName ()));

[jira] [Closed] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries


 [ 
https://issues.apache.org/jira/browse/SPARK-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-11448.
---
Resolution: Auto Closed

> We should skip caching part-files in ParquetRelation when configured to merge 
> schema and respect summaries
> --
>
> Key: SPARK-11448
> URL: https://issues.apache.org/jira/browse/SPARK-11448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We now cache part-files, metadata, common metadata in ParquetRelation as 
> currentLeafStatuses. However, when configured to merge schema and respect 
> summaries, dataStatuses (`FileStatus` objects of all part-files) are not 
> necessary anymore. We should skip them when caching in driver side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures


[ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036
 ] 

Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:36 AM:
--

So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode. This must be done on the driver node before evaluating 
the script;
- for each class from the above CU, we append it to the JAR on a distributed 
FS. We give each JAR an unique name, later passed on to addJar. This is just 
for isolation purposes so that each Script gets its own JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations and also 
nested closures don't seem to work;
- however, if you design your code around the problem of Groovy closures, using 
interfaces/abstract classes and you extend or implement those classes in Groovy 
code it seems there is no limitation on what you can do (eg. an AbstractFilter 
class/interface);

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append
GroovyClass groovyClass = (GroovyClass) compileClass;
JarArchiveEntry oneJarEntry = new JarArchiveEntry 
(String.format ("%s.class", groovyClass.getName ()));
oneJarEntry.setSize (groovyClass.getBytes ().length);
byte[] bytecodeOfClass = groovyClass.getBytes ();
oneJar.putArchiveEntry (oneJarEntry);
oneJar.write (bytecodeOfClass);
oneJar.closeArchiveEntry ();
}

// End it up
oneJar.finish ();
oneJar.close ();
} catch (Exception e) {
// Do something
}

// Append the JAR to the execution environment
sparkService.getSparkContext ().addJar (pathOfJar);

// GroovyShell.evaluate (scriptText, nameOfScript) below;
{noformat}

Any idea on how this can be improved? (eg. not using the addJar method and the 
requirement to not dehydrate the Groovy closures)


was (Author: antauri):
So here's the work-around, wished there was a way not to do it this way. In 
short, we are:
- taking the script by text and using CompilationUnit (from Groovy) to compile 
the script to bytecode;
- for each class we append it to the JAR;
- the resulting JAR is added to the Spark context using addJar;
- all closures must by dehydrated () which comes with its limitations;

{noformat}
// Compile it
Long dateOfNow = DateTime.now ().getMillis ();
String nameOfScript = String.format ("ScriptOf%d", dateOfNow);
String pathOfJar = String.format 
("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", 
dateOfNow));
File archiveFile = new File (pathOfJar);
Files.createParentDirs (archiveFile);

// With resources
List compilationList = compileGroovyScript (nameOfScript, 
sourceCode);

try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new 
FileOutputStream (new File (pathOfJar {
// For
for (Object compileClass : compilationList) {
// Append
GroovyClass groovyClass = (GroovyClass) compileClass;
JarArchiveEntry oneJarEntry = new JarArchiveEntry 
(String.format ("%s.class", groovyClass.getName ()));
oneJarEntry.setSize (groovyClass.getBytes ().length);
byte[] bytecodeOfClass = groovyClass.getBytes ();
oneJar.putArchiveEntry (oneJarEntry);
oneJar.write (bytecodeOfClass);
oneJar.closeArchiveEntry ();
}

// End it up
oneJar.finish ();
oneJar.close ();
} catch (Exception e) {
// Do something
}

// Append the JAR to the execution environment
sparkService.getSparkContext ().addJar (pathOfJar);

// GroovyShell.evaluate (scriptText, nameOfScript) below;
{noformat}

Any idea on how this can be improved? (eg. not using the

[jira] [Created] (SPARK-15689) Data source API v2

Reynold Xin created SPARK-15689:
---

 Summary: Data source API v2
 Key: SPARK-15689
 URL: https://issues.apache.org/jira/browse/SPARK-15689
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


This ticket tracks progress in creating the v2 of data source API. This new API 
should focus on:

1. Have a small surface so it is easy to freeze and maintain compatibility for 
a long time. Ideally, this API should survive architectural rewrites and 
user-facing API revamps of Spark.

2. Have a well-defined column batch interface for high performance. Convenience 
methods should exist to convert row-oriented formats into column batches for 
data source developers.

3. Still support filter push down, similar to the existing API.


Note that both 1 and 2 are problems that the current data source API (v1) 
suffers. The current data source API has a wide surface with dependency on 
DataFrame/SQLContext, making the data source API compatibility depending on the 
upper level API. The current data source API is also only row oriented and has 
to go through an expensive external data type conversion to internal data type.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14827) Spark SQL run on Hive table reports The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.

2016-05-31 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang closed SPARK-14827.
--
Resolution: Won't Fix

> Spark SQL run on Hive table reports The specified datastore driver 
> ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
> 
>
> Key: SPARK-14827
> URL: https://issues.apache.org/jira/browse/SPARK-14827
> Project: Spark
>  Issue Type: Question
>  Components: SQL, YARN
>Affects Versions: 1.5.2
> Environment: Hadoop-YARN 2.6.0   Spark-1.5.2  Hive-2.0.0
>Reporter: Wei Chen
>  Labels: hive, mysql, sql
>
> I have set up a YARN cluste and Hive. Hive uses mysql as its metadata store.  
> Hive works fine with MR engine. But when I ran Spark SQL on a table that was 
> already created before use following command:
> ./spark-sql --master yarn --database  hivedb -f  tpch_query1
> The error message is like:
>  WARN Connection: BoneCP specified but not present in CLASSPATH (or one of 
> dependencies)
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
>   at 
> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   ... 25 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at

[jira] [Commented] (SPARK-14827) Spark SQL run on Hive table reports The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.

2016-05-31 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309304#comment-15309304
 ] 

Jeff Zhang commented on SPARK-14827:


It is not a bug as the exception is clear that jdbc driver jar is missing on 
classpath. 

> Spark SQL run on Hive table reports The specified datastore driver 
> ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
> 
>
> Key: SPARK-14827
> URL: https://issues.apache.org/jira/browse/SPARK-14827
> Project: Spark
>  Issue Type: Question
>  Components: SQL, YARN
>Affects Versions: 1.5.2
> Environment: Hadoop-YARN 2.6.0   Spark-1.5.2  Hive-2.0.0
>Reporter: Wei Chen
>  Labels: hive, mysql, sql
>
> I have set up a YARN cluste and Hive. Hive uses mysql as its metadata store.  
> Hive works fine with MR engine. But when I ran Spark SQL on a table that was 
> already created before use following command:
> ./spark-sql --master yarn --database  hivedb -f  tpch_query1
> The error message is like:
>  WARN Connection: BoneCP specified but not present in CLASSPATH (or one of 
> dependencies)
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
>   at 
> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   ... 25 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
>

[jira] [Updated] (SPARK-15687) Columnar execution engine


 [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15687:

Description: 
This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

>From the architectural perspective: 
- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we encode nested data? What are the operations on nested data, and how 
do we handle these operations in a columnar format?
- What is the transition plan towards the end state?

>From an external API perspective:
- Can we expose a more efficient column batch user-defined function API?
- How do we leverage this to integrate with 3rd party tools?
- Can we have a spec for a fixed version of the column batch format that can be 
externalized and use that in data source API v2?

  was:
This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we encode nested data? What are the operations on nested data, and how 
do we handle these operations in a columnar format?
- What is the transition plan towards the end state?



> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.

2016-05-31 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309299#comment-15309299
 ] 

Xiao Li commented on SPARK-15688:
-

Yeah, will ask my teammate to do it ASAP. Thanks!

> RelationalGroupedDataset.toDF should not add group by expressions that are 
> already added in the aggregate expressions.
> --
>
> Key: SPARK-15688
> URL: https://issues.apache.org/jira/browse/SPARK-15688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to 
> have col appearing twice in the result. Seems we can avoid of output group by 
> expressions twice if it already a part of {{agg}}. Looks like 
> RelationalGroupedDataset.toDF is the place to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15687) Columnar execution engine


 [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15687:

Description: 
This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we encode nested data? What are the operations on nested data, and how 
do we handle these operations in a columnar format?
- What is the transition plan towards the end state?


  was:
This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we handle nested data types?
- What is the transition plan towards the end state?



> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.


[ 
https://issues.apache.org/jira/browse/SPARK-15688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309293#comment-15309293
 ] 

Yin Huai commented on SPARK-15688:
--

[~smilegator] Anyone from your side can take this?

> RelationalGroupedDataset.toDF should not add group by expressions that are 
> already added in the aggregate expressions.
> --
>
> Key: SPARK-15688
> URL: https://issues.apache.org/jira/browse/SPARK-15688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to 
> have col appearing twice in the result. Seems we can avoid of output group by 
> expressions twice if it already a part of {{agg}}. Looks like 
> RelationalGroupedDataset.toDF is the place to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15687) Columnar execution engine


 [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15687:

Summary: Columnar execution engine  (was: Fully columnar execution engine)

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we handle nested data types?
> - What is the transition plan towards the end state?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15687) Columnar execution engine


 [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15687:

Priority: Critical  (was: Major)

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we handle nested data types?
> - What is the transition plan towards the end state?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.

Yin Huai created SPARK-15688:


 Summary: RelationalGroupedDataset.toDF should not add group by 
expressions that are already added in the aggregate expressions.
 Key: SPARK-15688
 URL: https://issues.apache.org/jira/browse/SPARK-15688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to have 
col appearing twice in the result. Seems we can avoid of output group by 
expressions twice if it already a part of {{agg}}. Looks like 
RelationalGroupedDataset.toDF is the place to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15687) Fully columnar execution engine


 [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15687:

Summary: Fully columnar execution engine  (was: Columnar execution engine)

> Fully columnar execution engine
> ---
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we handle nested data types?
> - What is the transition plan towards the end state?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15687) Columnar execution engine

Reynold Xin created SPARK-15687:
---

 Summary: Columnar execution engine
 Key: SPARK-15687
 URL: https://issues.apache.org/jira/browse/SPARK-15687
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we handle nested data types?
- What is the transition plan towards the end state?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2016-05-31 Thread Brett Randall (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brett Randall updated SPARK-15685:
--
Description: 
Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected behaviour:* Since we are running in local mode, we might expect some 
unchecked exception to be thrown, to be optionally-handled by the Spark caller. 
 In the case of Jetty, a request thread or some other background worker thread 
might handle the exception or not, the thread might exit or note an error.  The 
caller should decide how the error is handled.

*Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
microservice is down and must be restarted.

*Consequence:* Any local code or third-party library might cause Spark to exit 
the long-running JVM/microservice, so Spark can be a problem in this 
architecture.  I have seen this now on three separate occasions, leading to 
service-down bug reports.

*Analysis:*

The line of code that seems to be the problem is: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405

{code}
// Don't forcibly exit unless the exception was inherently fatal, to avoid
// stopping other tasks unnecessarily.
if (Utils.isFatalError(t)) {
SparkUncaughtExceptionHandler.uncaughtException(t)
}
{code}

[Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
 first excludes Scala 
[NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
 which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
{{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
{{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
{{NotImplementedError}} and {{ControlThrowable}}.

Remaining are {{Error}} s such as {{StackOverflowError extends 
VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
occur in the aforementioned scenarios.  
{{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
{{System.exit()}}.

[Further 
up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
 in in {{Executor}} we see exclusions for registering 
{{SparkUncaughtExceptionHandler}} if in local mode:

{code}
  if (!isLocal) {
// Setup an uncaught exception handler for non-local mode.
// Make any thread terminations due to uncaught exceptions kill the entire
// executor process to avoid surprising stalls.
Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
  }
{code}

This same exclusion must be applied for local mode for "fatal" errors - cannot 
afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.

A minimal test-case is supplied.  It installs a logging {{SecurityManager}} to 
confirm that {{System.exit()}} was called from 
{{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It also 
hints at the workaround - install your own {{SecurityManager}} and inspect the 
current stack in {{checkExit()}} to prevent Spark from exiting the JVM.

Test-case: https://github.com/javabrett/SPARK-15685 .

  was:
Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected

[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

2016-05-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309261#comment-15309261
 ] 

Takeshi Yamamuro commented on SPARK-15530:
--

Oh, I see. Okay, then I'll check the logs  and make pr later.

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive


[ 
https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309253#comment-15309253
 ] 

Yin Huai commented on SPARK-15032:
--

I did a quick check today. Seems our current master it is fine. When we open a 
JDBC session, we indeed create the session state correctly. I am closing this 
jira.

> When we create a new JDBC session, we may need to create a new session of 
> executionHive
> ---
>
> Key: SPARK-15032
> URL: https://issues.apache.org/jira/browse/SPARK-15032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, we only use executionHive in thriftserver. When we create a new 
> jdbc session, we probably need to create a new session of executionHive. I am 
> not sure what will break if we leave the code as is. But, I feel it will be 
> safer to create a new session of executionHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive


 [ 
https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15032.
--
Resolution: Not A Problem

> When we create a new JDBC session, we may need to create a new session of 
> executionHive
> ---
>
> Key: SPARK-15032
> URL: https://issues.apache.org/jira/browse/SPARK-15032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, we only use executionHive in thriftserver. When we create a new 
> jdbc session, we probably need to create a new session of executionHive. I am 
> not sure what will break if we leave the code as is. But, I feel it will be 
> safer to create a new session of executionHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism


[ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309250#comment-15309250
 ] 

Yin Huai commented on SPARK-15530:
--

The doc of that conf says "The degree of parallelism for schema merging and 
partition discovery of Parquet data sources.". So seems it just means 
parallelism. I am not sure why the variable name has a "threshold" at the end. 
It will be great if you can check when we added that conf. 

Right now, I am inclined to use that conf to configure the parallelism (we need 
to make that conf an internal conf though). 

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15672) R programming guide update

2016-05-31 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309249#comment-15309249
 ] 

Shivaram Venkataraman commented on SPARK-15672:
---

Thanks [~GayathriMurali] - This PR can handle changes other than ML to the 
programming guide

> R programming guide update
> --
>
> Key: SPARK-15672
> URL: https://issues.apache.org/jira/browse/SPARK-15672
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> Update the the programming guide (i.e. the document at 
> http://spark.apache.org/docs/latest/sparkr.html) to cover the major new 
> features in Spark 2.0. This will include 
> (a) UDFs with dapply, dapplyCollect
> (b) group UDFs with gapply 
> (c) spark.lapply for running parallel R functions
> (d) others ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

2016-05-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309246#comment-15309246
 ] 

Takeshi Yamamuro commented on SPARK-15530:
--

You suggest we should rename the value and change the default value?

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism


[ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309240#comment-15309240
 ] 

Yin Huai commented on SPARK-15530:
--

Your change looks reasonable. How about we just take the value of 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold? We can 
change the doc of this conf and increate the default value. 

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15683) spark sql local FS spark.sql.warehouse.dir throws on YARN

2016-05-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309139#comment-15309139
 ] 

Saisai Shao commented on SPARK-15683:
-

Hi [~tgraves], please see this JIRA SPARK-15659, I submitted a patch about it.

> spark sql local FS spark.sql.warehouse.dir throws on YARN
> -
>
> Key: SPARK-15683
> URL: https://issues.apache.org/jira/browse/SPARK-15683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I'm trying to use dataframes with spark 2.0.  It was built with hive but when 
> I try to run a dataframe command I get the error:
> 16/05/31 20:24:21 ERROR ApplicationMaster: User class threw exception: 
> java.lang.IllegalArgumentException: Wrong FS: 
> file:/grid/2/tmp/yarn-local/usercache/tgraves/appcache/application_1464289177693_1036410/container_e14_1464289177693_1036410_01_01/spark-warehouse,
>  expected: hdfs://nn1.com:8020
> java.lang.IllegalArgumentException: Wrong FS: 
> file:/grid/2/tmp/yarn-local/usercache/tgraves/appcache/application_1464289177693_1036410/container_e14_1464289177693_1036410_01_01/spark-warehouse,
>  expected: hdfs://nn1.com:8020
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1047)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1043)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1061)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:1036)
>   at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1880)
>   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:123)
>   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.createDatabase(InMemoryCatalog.scala:122)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:84)
>   at 
> org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:94)
>   at 
> org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:94)
>   at 
> org.apache.spark.sql.internal.SessionState$$anon$1.(SessionState.scala:110)
>   at 
> org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:110)
>   at 
> org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:109)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>   at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:371)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:154)
>   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:419)
>   at yahoo.spark.SparkFlickrLargeJoin$.main(SparkFlickrLargeJoin.scala:26)
>   at yahoo.spark.SparkFlickrLargeJoin.main(SparkFlickrLargeJoin.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:617)
> It seems https://issues.apache.org/jira/browse/SPARK-15565 change it to have 
> default local fs. Even before that it didn't work  either, just different 
> error -> https://issues.apache.org/jira/browse/SPARK-15034.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12988) Can't drop columns that contain dots


 [ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12988:
-
Assignee: Sean Zhong

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12988) Can't drop columns that contain dots


 [ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12988.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309135#comment-15309135
 ] 

Yin Huai commented on SPARK-12988:
--

This issue has been resolved by https://github.com/apache/spark/pull/13306.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-05-31 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309112#comment-15309112
 ] 

Gayathri Murali commented on SPARK-14381:
-

I will work on this

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15672) R programming guide update

2016-05-31 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309100#comment-15309100
 ] 

Gayathri Murali commented on SPARK-15672:
-

[~shivaram] I am working on changing R documentation to include all changes 
that happened with ML. Here is the link to the PR : 
https://github.com/apache/spark/pull/13285



> R programming guide update
> --
>
> Key: SPARK-15672
> URL: https://issues.apache.org/jira/browse/SPARK-15672
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> Update the the programming guide (i.e. the document at 
> http://spark.apache.org/docs/latest/sparkr.html) to cover the major new 
> features in Spark 2.0. This will include 
> (a) UDFs with dapply, dapplyCollect
> (b) group UDFs with gapply 
> (c) spark.lapply for running parallel R functions
> (d) others ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2016-05-31 Thread Brett Randall (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brett Randall updated SPARK-15685:
--
Description: 
Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected behaviour:* Since we are running in local mode, we might expect some 
unchecked exception to be thrown, to be optionally-handled by the Spark caller. 
 In the case of Jetty, a request thread or some other background worker thread 
might handle the exception or not, the thread might exit or note an error.  The 
caller should decide how the error is handled.

*Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
microservice is down and must be restarted.

*Consequence:* Any local code or third-party library might cause Spark to exit 
the long-running JVM/microservice, so Spark can be a problem in this 
architecture.  I have seen this now on three separate occasions, leading to 
service-down bug reports.

*Analysis:*

The line of code that seems to be the problem is: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405

{code}
// Don't forcibly exit unless the exception was inherently fatal, to avoid
// stopping other tasks unnecessarily.
if (Utils.isFatalError(t)) {
SparkUncaughtExceptionHandler.uncaughtException(t)
}
{code}

[Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
 first excludes Scala 
[NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
 which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
{{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
{{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
{{NotImplementedError}} and {{ControlThrowable}}.

Remaining are {{Error}}s such as {{StackOverflowError extends 
VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
occur in the aforementioned scenarios.  
{{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
{{System.exit()}}.

[Further 
up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
 in in {{Executor}} we see exclusions for registering 
{{SparkUncaughtExceptionHandler}} if in local mode:

{code}
  if (!isLocal) {
// Setup an uncaught exception handler for non-local mode.
// Make any thread terminations due to uncaught exceptions kill the entire
// executor process to avoid surprising stalls.
Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
  }
{code}

This same exclusion must be applied for local mode for "fatal" errors - cannot 
afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.

A minimal test-case is supplied.  It installs a logging {{SecurityManager}} to 
confirm that {{System.exit()}} was called from 
{{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It also 
hints at the workaround - install your own {{SecurityManager}} and inspect the 
current stack in {{checkExit()}} to prevent Spark from exiting the JVM.

Test-case: https://github.com/javabrett/SPARK-15685 .

  was:
Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected

[jira] [Commented] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming


[ 
https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309075#comment-15309075
 ] 

Apache Spark commented on SPARK-15686:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13429

> Move user-facing structured streaming classes into sql.streaming
> 
>
> Key: SPARK-15686
> URL: https://issues.apache.org/jira/browse/SPARK-15686
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15686:


Assignee: Reynold Xin  (was: Apache Spark)

> Move user-facing structured streaming classes into sql.streaming
> 
>
> Key: SPARK-15686
> URL: https://issues.apache.org/jira/browse/SPARK-15686
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15686:


Assignee: Apache Spark  (was: Reynold Xin)

> Move user-facing structured streaming classes into sql.streaming
> 
>
> Key: SPARK-15686
> URL: https://issues.apache.org/jira/browse/SPARK-15686
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming

Reynold Xin created SPARK-15686:
---

 Summary: Move user-facing structured streaming classes into 
sql.streaming
 Key: SPARK-15686
 URL: https://issues.apache.org/jira/browse/SPARK-15686
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

2016-05-31 Thread Brett Randall (JIRA)

Brett Randall created SPARK-15685:
-

 Summary: StackOverflowError (VirtualMachineError) or 
NoClassDefFoundError (LinkageError) should not System.exit() in local mode
 Key: SPARK-15685
 URL: https://issues.apache.org/jira/browse/SPARK-15685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Brett Randall
Priority: Critical


Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected behaviour:* Since we are running in local mode, we might expect some 
unchecked exception to be thrown, to be optionally-handled by the Spark caller. 
 In the case of Jetty, a request thread or some other background worker thread 
might handle the exception or not, the thread might exit or note an error.  The 
caller should decide how the error is handled.

*Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
microservice is down and must be restarted.

*Consequence:* Any local code or third-party library might cause Spark to exit 
the long-running JVM/microservice, so Spark can be a problem in this 
architecture.  I have seen this now on three separate occasions, leading to 
service-down bug reports.

*Analysis:*

The line of code that seems to be the problem is: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405

{code}
// Don't forcibly exit unless the exception was inherently fatal, to avoid
// stopping other tasks unnecessarily.
if (Utils.isFatalError(t)) {
SparkUncaughtExceptionHandler.uncaughtException(t)
}
{code}

[Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
 first excludes Scala 
[NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
 which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
{{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
{{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
{{NotImplementedError}} and {{ControlThrowable}}.

Remaining are {{Error}}s such as {{StackOverflowError extends 
VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
occur in the aforementioned scenarios.  
{{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
{{System.exit()}}.

[Further 
up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
 in in {{Executor}} we see exclusions for registering 
{{SparkUncaughtExceptionHandler}} if in local mode:

{code}
  if (!isLocal) {
// Setup an uncaught exception handler for non-local mode.
// Make any thread terminations due to uncaught exceptions kill the entire
// executor process to avoid surprising stalls.
Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
  }
{code}

This same exclusion must be applied for local mode for "fatal" errors - cannot 
afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.

A minimal test-case is supplied.  It installs a logging {{SecurityManager}} to 
confirm that {{System.exit()}} was called from 
{{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It also 
hints at the workaround - install your own {{SecurityManager}} and inspect the 
current stack in {{checkExit()}} to prevent Spark from exiting the JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel

2016-05-31 Thread Xin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-15681:
---
Description: 
Currently SparkContext API setLogLevel(level: String) can not handle lower case 
or mixed case input string. But org.apache.log4j.Level.toLevel can take 
lowercase or mixed case. 


  was:
Currently SparkContext API setLogLevel(level: String) can not handle lower case 
or mixed case input string. But org.apache.log4j.Level.toLevel can take 
lowercase or mixed case. 

Also resetLogLevel to original configuration could be helpful for users to 
switch  log level for different diagnostic purposes.




> Allow case-insensitiveness in sc.setLogLevel
> 
>
> Key: SPARK-15681
> URL: https://issues.apache.org/jira/browse/SPARK-15681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Minor
>
> Currently SparkContext API setLogLevel(level: String) can not handle lower 
> case or mixed case input string. But org.apache.log4j.Level.toLevel can take 
> lowercase or mixed case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel

2016-05-31 Thread Xin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-15681:
---
Summary: Allow case-insensitiveness in sc.setLogLevel  (was: Allow 
case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel)

> Allow case-insensitiveness in sc.setLogLevel
> 
>
> Key: SPARK-15681
> URL: https://issues.apache.org/jira/browse/SPARK-15681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Minor
>
> Currently SparkContext API setLogLevel(level: String) can not handle lower 
> case or mixed case input string. But org.apache.log4j.Level.toLevel can take 
> lowercase or mixed case. 
> Also resetLogLevel to original configuration could be helpful for users to 
> switch  log level for different diagnostic purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap


[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309033#comment-15309033
 ] 

Joseph K. Bradley commented on SPARK-15581:
---

The roadmap links to [SPARK-4591], which contains it.  That task is also listed 
as Critical and targeted at 2.1, so it should be prioritized.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity

[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap

[
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-15581:
--
Description:
This is a master list for MLlib improvements we are working on for the next
release. Please view this as a wish list rather than a definite plan, for we
don't have an accurate estimate of available resources. Due to limited review
bandwidth, features appearing on this list will get higher priority during code
review. But feel free to suggest new items to the list in comments. We are
experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather
than a medium/big feature. Based on our experience, mixing the development
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when
you start working on some features. This is to avoid duplicate work. For small
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned
first before coding and keep the ETA updated on the JIRA. If there exist no
activity on the JIRA page for a certain amount of time, the JIRA should be
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one
after another.
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review
greatly helps to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link
them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs,
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and
documentation if applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 2.1|
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
We only include umbrella JIRAs and high-level tasks.

Major efforts in this release:
* Feature parity for the DataFrames-based API (`spark.ml`), relative to the
RDD-based API
* ML persistence
* Python API feature parity and test coverage
* R API expansion and improvements
* Note about new features: As usual, we expect to expand the feature set of
MLlib. However, we will prioritize API parity, bug fixes, and improvements
over new features.

Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for
it, but new features, APIs, and improvements will only be added to `spark.ml`.

h2. Critical feature parity in DataFrame-based API

* Umbrella JIRA: [SPARK-4591]

h2. Persistence

* Complete persistence within MLlib
** Python tuning (SPARK-13786)
* MLlib in R format: compatibility with other languages (SPARK-15572)
* Impose backwards compatibility for persistence (SPARK-15573)

h2. Python API
* Standardize unit tests for Scala and Python to improve and consolidate test
coverage for Params, persistence, and other common functionality (SPARK-15571)
* Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706)
** Note: The linked JIRAs for this are incomplete. More to be created...
** Related: Implement Python meta-algorithms in Scala (to simplify persistence)
(SPARK-15574)
* Feature parity: The main goal of the Python API is to have feature parity
with the Scala/Java API. You can find a [complete list here|
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC].
The tasks fall into two major categories:
** Python API for missing methods (SPARK-14813)
** Python API for new algorithms. Committers should create a JIRA for the
Python API after merging a public feature in Scala/Java.

h2. SparkR
* Improve R formula support and implementation (SPARK-15540)
*

[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing


[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309031#comment-15309031
 ] 

Joseph K. Bradley commented on SPARK-12347:
---

[~junzheng] Feel free to work on it; sorry for the slow response.  I'd 
recommending pinging [~holdenk] on [SPARK-12428] to sync and split up the work.

I just targeted this for 2.1 since [SPARK-15605] uncovered a broken example.  
Having tests would make it much easier to find the culprit PR!

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing


 [ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12347:
--
Target Version/s: 2.1.0
Priority: Critical  (was: Major)

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15605) ML JavaDeveloperApiExample is broken


 [ 
https://issues.apache.org/jira/browse/SPARK-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15605:
--
Priority: Major  (was: Minor)

> ML JavaDeveloperApiExample is broken
> 
>
> Key: SPARK-15605
> URL: https://issues.apache.org/jira/browse/SPARK-15605
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Affects Versions: 2.0.0
>Reporter: Yanbo Liang
>
> This bug is reported by
> http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html
>  .
> To fix the issue mentioned in the mailing list, we can implement maxIter as 
> following:
> {code}
>   private IntParam maxIter_;
>   public IntParam maxIter() {
> return maxIter_;
>   }
>   public int getMaxIter() {
> return (Integer) getOrDefault(maxIter_);
>   }
>   public MyJavaLogisticRegression setMaxIter(int value) {
> return (MyJavaLogisticRegression) set(maxIter_, value);
>   }
>   private void init() {
> maxIter_ = new IntParam(this, "maxIter", "max number of iterations");
> setDefault(maxIter(), 100);
>   }
> {code}
> But this does not solve all problems, it will throw new exceptions:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Param null__featuresCol does not belong to myJavaLogReg_fbcf0d015036.
> {code}
> The shared params such as "featuresCol" should also have Java-friendly 
> wrappers.
> Look through the code base, I found we only implement JavaParams who is the 
> wrappers of Scala Params.
> We still need Java-friendly wrappers for other traits who extends from Scala 
> Params.
> For example, in Scala we have: 
> {code}
> trait HasLabelCol extends Params
> {code}
> We should have the Java-friendly wrappers as follows:
> {code}
> class JavaHasLabelCol extends Params
> {code}
> Otherwise, "Param.parent" will be null for each param and it will throw 
> exceptions when calling "Param.hasParam()".
> I think the Java compatibility for Params/Param should be further defined in 
> the next release cycle, and it's better to remove the 
> "JavaDeveloperApiExample" at present. Since currently we can not well support 
> users to implement their own algorithm using Estimator/Transformer by Java, 
> it may mislead users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15605) ML JavaDeveloperApiExample is broken


 [ 
https://issues.apache.org/jira/browse/SPARK-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15605:
--
Affects Version/s: 2.0.0

> ML JavaDeveloperApiExample is broken
> 
>
> Key: SPARK-15605
> URL: https://issues.apache.org/jira/browse/SPARK-15605
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Affects Versions: 2.0.0
>Reporter: Yanbo Liang
>
> This bug is reported by
> http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html
>  .
> To fix the issue mentioned in the mailing list, we can implement maxIter as 
> following:
> {code}
>   private IntParam maxIter_;
>   public IntParam maxIter() {
> return maxIter_;
>   }
>   public int getMaxIter() {
> return (Integer) getOrDefault(maxIter_);
>   }
>   public MyJavaLogisticRegression setMaxIter(int value) {
> return (MyJavaLogisticRegression) set(maxIter_, value);
>   }
>   private void init() {
> maxIter_ = new IntParam(this, "maxIter", "max number of iterations");
> setDefault(maxIter(), 100);
>   }
> {code}
> But this does not solve all problems, it will throw new exceptions:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Param null__featuresCol does not belong to myJavaLogReg_fbcf0d015036.
> {code}
> The shared params such as "featuresCol" should also have Java-friendly 
> wrappers.
> Look through the code base, I found we only implement JavaParams who is the 
> wrappers of Scala Params.
> We still need Java-friendly wrappers for other traits who extends from Scala 
> Params.
> For example, in Scala we have: 
> {code}
> trait HasLabelCol extends Params
> {code}
> We should have the Java-friendly wrappers as follows:
> {code}
> class JavaHasLabelCol extends Params
> {code}
> Otherwise, "Param.parent" will be null for each param and it will throw 
> exceptions when calling "Param.hasParam()".
> I think the Java compatibility for Params/Param should be further defined in 
> the next release cycle, and it's better to remove the 
> "JavaDeveloperApiExample" at present. Since currently we can not well support 
> users to implement their own algorithm using Estimator/Transformer by Java, 
> it may mislead users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15678) Not use cache on appends and overwrites

2016-05-31 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15678:
---
Summary: Not use cache on appends and overwrites  (was: Drop cache on 
appends and overwrites)

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15675) Implicit resolution doesn't work in multiple Statements in Spark Repl

2016-05-31 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308998#comment-15308998
 ] 

Prashant Sharma commented on SPARK-15675:
-

Looks like https://github.com/scala/scala/pull/5084/files looks into this 
issue. Just to verify, I am to going to run spark with special build of scala, 
to see if this upstream fix nails it.

> Implicit resolution doesn't work in multiple Statements in Spark Repl 
> --
>
> Key: SPARK-15675
> URL: https://issues.apache.org/jira/browse/SPARK-15675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Shixiong Zhu
>
> Here is the reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Y(i: Int)
> Seq(Y(1)).toDS()
> // Exiting paste mode, now interpreting.
> :14: error: value toDS is not a member of Seq[Y]
>Seq(Y(1)).toDS()
>  ^
> scala> case class Y(i: Int); Seq(Y(1)).toDS()
> :13: error: value toDS is not a member of Seq[Y]
> Seq(Y(1)).toDS()
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results


 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14343:
-
Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-15631)

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results


 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14343:
-
Target Version/s: 2.0.0
Priority: Critical  (was: Blocker)

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15601) CircularBuffer's toString() to print only the contents written if buffer isn't full

2016-05-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15601.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 13351
[https://github.com/apache/spark/pull/13351]

> CircularBuffer's toString() to print only the contents written if buffer 
> isn't full
> ---
>
> Key: SPARK-15601
> URL: https://issues.apache.org/jira/browse/SPARK-15601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Trivial
> Fix For: 2.0.0, 1.6.2
>
>
> [~sameerag] suggested this over a PR discussion: 
> https://github.com/apache/spark/pull/12194#discussion_r64495331
> If CircularBuffer isn't full, currently toString() will print some garbage 
> chars along with the content written as is tries to print the entire array 
> allocated for the buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15675) Implicit resolution doesn't work in multiple Statements in Spark Repl

2016-05-31 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308989#comment-15308989
 ] 

Prashant Sharma commented on SPARK-15675:
-

Interesting, I am looking into it. I will have to add tests in scala/scala, to 
guard against changes that break these things. I am assuming something has 
changed around how repl handles multiline statements. Because, I think this did 
not happen earlier.

> Implicit resolution doesn't work in multiple Statements in Spark Repl 
> --
>
> Key: SPARK-15675
> URL: https://issues.apache.org/jira/browse/SPARK-15675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Shixiong Zhu
>
> Here is the reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Y(i: Int)
> Seq(Y(1)).toDS()
> // Exiting paste mode, now interpreting.
> :14: error: value toDS is not a member of Seq[Y]
>Seq(Y(1)).toDS()
>  ^
> scala> case class Y(i: Int); Seq(Y(1)).toDS()
> :13: error: value toDS is not a member of Seq[Y]
> Seq(Y(1)).toDS()
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15236) No way to disable Hive support in REPL


 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15236.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Xin Wu
> Fix For: 2.0.0
>
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible


 [ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15618.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL


 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15236:
--
Assignee: Xin Wu

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Xin Wu
> Fix For: 2.0.0
>
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT


 [ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12666:


Assignee: Apache Spark

> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main" java.lang.RuntimeException: [unresolved 
> dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration 
> not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It 
> was required from org.apache.spark#spark-submit-parent;1.0 default]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> I think the problem here is that Spark is declaring a dependency on the 
> spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
> admittedly limited understanding of Ivy, the default configuration will be 
> the only configuration defined in an Ivy artifact if that artifact defines no 
> other configurations. Thus, for Maven artifacts I think the default 
> configuration will end up mapping to Maven's regular JAR dependency (i.e. 
> Maven artifacts don't declare Ivy configurations so they implicitly have the 
> {{default}} configuration) but for Ivy artifacts I think we can run into 
> trouble when loading artifacts which explicitly define their own 
> configurations, since those artifacts might not have a configuration named 
> {{default}}.
> I spent a bit of time playing around with the SparkSubmit code to see if I 
> could fix this but wasn't able to completely resolve the issue.
> /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
> if you'd like)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT


[ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308965#comment-15308965
 ] 

Apache Spark commented on SPARK-12666:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/13428

> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main" java.lang.RuntimeException: [unresolved 
> dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration 
> not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It 
> was required from org.apache.spark#spark-submit-parent;1.0 default]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> I think the problem here is that Spark is declaring a dependency on the 
> spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
> admittedly limited understanding of Ivy, the default configuration will be 
> the only configuration defined in an Ivy artifact if that artifact defines no 
> other configurations. Thus, for Maven artifacts I think the default 
> configuration will end up mapping to Maven's regular JAR dependency (i.e. 
> Maven artifacts don't declare Ivy configurations so they implicitly have the 
> {{default}} configuration) but for Ivy artifacts I think we can run into 
> trouble when loading artifacts which explicitly define their own 
> configurations, since those artifacts might not have a configuration named 
> {{default}}.
> I spent a bit of time playing around with the SparkSubmit code to see if I 
> could fix this but wasn't able to completely resolve the issue.
> /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
> if you'd like)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT


 [ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12666:


Assignee: (was: Apache Spark)

> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main" java.lang.RuntimeException: [unresolved 
> dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration 
> not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It 
> was required from org.apache.spark#spark-submit-parent;1.0 default]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> I think the problem here is that Spark is declaring a dependency on the 
> spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
> admittedly limited understanding of Ivy, the default configuration will be 
> the only configuration defined in an Ivy artifact if that artifact defines no 
> other configurations. Thus, for Maven artifacts I think the default 
> configuration will end up mapping to Maven's regular JAR dependency (i.e. 
> Maven artifacts don't declare Ivy configurations so they implicitly have the 
> {{default}} configuration) but for Ivy artifacts I think we can run into 
> trouble when loading artifacts which explicitly define their own 
> configurations, since those artifacts might not have a configuration named 
> {{default}}.
> I spent a bit of time playing around with the SparkSubmit code to see if I 
> could fix this but wasn't able to completely resolve the issue.
> /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
> if you'd like)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class


 [ 
https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15670:
--
Assignee: Weichen Xu

> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
> --
>
> Key: SPARK-15670
> URL: https://issues.apache.org/jira/browse/SPARK-15670
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class


 [ 
https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15670.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
> --
>
> Key: SPARK-15670
> URL: https://issues.apache.org/jira/browse/SPARK-15670
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Reporter: Weichen Xu
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15680) Disable comments in generated code in order to avoid performance issues


 [ 
https://issues.apache.org/jira/browse/SPARK-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15680.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Disable comments in generated code in order to avoid performance issues
> ---
>
> Key: SPARK-15680
> URL: https://issues.apache.org/jira/browse/SPARK-15680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
> Attachments: pasted_image_at_2016_05_29_03_02_pm.png, 
> pasted_image_at_2016_05_29_03_03_pm.png
>
>
> In benchmarks involving tables with very wide and complex schemas (thousands 
> of columns, deep nesting), I noticed that significant amounts of time (order 
> of tens of seconds per task) were being spent generating comments during the 
> code generation phase.
> The root cause of the performance problem stems from the fact that calling 
> {{toString()}} on a complex expression can involve thousands of string 
> concatenations, resulting in huge amounts (tens of gigabytes) of character 
> array allocation and copying (see attached profiler screenshots)
> In the long term, we can avoid this problem by passing StringBuilders down 
> the tree and using them to accumulate output. In the short term, however, I 
> think that we should just disable comments in the generated code by default 
> since very long comments are typically not useful debugging aids (since 
> they're truncated for display anyways).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15662) Add since annotation for classes in sql.catalog


 [ 
https://issues.apache.org/jira/browse/SPARK-15662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15662.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add since annotation for classes in sql.catalog
> ---
>
> Key: SPARK-15662
> URL: https://issues.apache.org/jira/browse/SPARK-15662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15684) Not mask startsWith and endsWith in R

2016-05-31 Thread Miao Wang (JIRA)

Miao Wang created SPARK-15684:
-

 Summary: Not mask startsWith and endsWith in R
 Key: SPARK-15684
 URL: https://issues.apache.org/jira/browse/SPARK-15684
 Project: Spark
  Issue Type: Improvement
Reporter: Miao Wang
Priority: Minor


In R 3.3.0, it has startsWith and endsWith. We should not mask this two methods 
in Spark. Actually, Spark R has startsWith and endsWith working for column. But 
making them work for both column and string is not easy. I create this JIRA for 
discussions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT

2016-05-31 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308950#comment-15308950
 ] 

Bryan Cutler commented on SPARK-12666:
--

This seems like it is more of an issue with SBT publishLocal, as it publishes 
an ivy.xml file without a "default" conf.  Like this, for example

{noformat}

http://ant.apache.org/ivy/extra;>
...











...
{noformat}

If this file had a conf name="default", or if there were no confs defined an 
implicit "default" would be used.  But since it is missing, there is an error 
when resolving.  I think it's reasonable for Spark to expect a "default" conf 
to exist, but we can make it a bit more robust to handle cases like this where 
it doesn't exist.  I'll post a PR for this.

> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main" java.lang.RuntimeException: [unresolved 
> dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration 
> not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It 
> was required from org.apache.spark#spark-submit-parent;1.0 default]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> I think the problem here is that Spark is declaring a dependency on the 
> spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
> admittedly limited understanding of Ivy, the default configuration will be 
> the only configuration defined in an Ivy artifact if that artifact defines no 
> other configurations. Thus, for Maven artifacts I think the default 
> configuration will end up mapping to Maven's regular JAR dependency (i.e. 
> Maven artifacts don't declare Ivy configurations so they implicitly have the 
> {{default}} configuration) but for Ivy artifacts I think we can run into 
> trouble when loading artifacts which explicitly define their own 
> configurations, since those artifacts might not have a configuration named 
> {{default}}.
> I spent a bit of time playing around with the SparkSubmit code to see if I 
> could fix this but wasn't able to completely resolve the issue.
> /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
> if you'd like)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15684) Not mask startsWith and endsWith in R

2016-05-31 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308940#comment-15308940
 ] 

Miao Wang commented on SPARK-15684:
---

[~yanboliang] Per our discussion, this one is similar to implement read_csv in 
Spark R and there is no nice solution for this type of issues. Can you comment 
on this jira? Thanks!

> Not mask startsWith and endsWith in R
> -
>
> Key: SPARK-15684
> URL: https://issues.apache.org/jira/browse/SPARK-15684
> Project: Spark
>  Issue Type: Improvement
>Reporter: Miao Wang
>Priority: Minor
>
> In R 3.3.0, it has startsWith and endsWith. We should not mask this two 
> methods in Spark. Actually, Spark R has startsWith and endsWith working for 
> column. But making them work for both column and string is not easy. I create 
> this JIRA for discussions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7

2016-05-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15451.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.0.0

> Spark PR builder should fail if code doesn't compile against JDK 7
> --
>
> Key: SPARK-15451
> URL: https://issues.apache.org/jira/browse/SPARK-15451
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> We need to compile certain parts of the build using jdk8, so that we test 
> things like lambdas. But when possible, we should either compile using jdk7, 
> or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in 
> jdk8-specific library calls.
> I'll take a look at fixing the maven / sbt files, but I'm not sure how to 
> update the PR builders since this will most probably require at least a new 
> env variable (to say where jdk7 is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11474) Options to jdbc load are lower cased

2016-05-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11474:

Component/s: (was: Input/Output)
 SQL

> Options to jdbc load are lower cased
> 
>
> Key: SPARK-11474
> URL: https://issues.apache.org/jira/browse/SPARK-11474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Linux & Mac
>Reporter: Stephen Samuel
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 1.5.3, 1.6.0
>
>
> We recently upgraded from spark 1.3.0 to 1.5.1 and one of the features we 
> wanted to take advantage of was the fetchSize added to the jdbc data frame 
> reader.
> In 1.5.1 there appears to be a bug or regression, whereby an options map has 
> its keys lowercased. This means the existing properties from prior to 1.4 are 
> ok, such as dbtable, url and driver, but the newer fetchSize gets converted 
> to fetchsize.
> To re-produce:
> val conf = new SparkConf(true).setMaster("local").setAppName("fetchtest")
> val sc = new SparkContext(conf)
> val sql = new SQLContext(sc)
> val options = Map("url" -> , "driver" -> , "fetchSize" -> )
> val df = sql.load("jdbc", options)
> Breakpoint at line 371 in JDBCRDD and you'll see the options are all 
> lowercased, so:
> val fetchSize = properties.getProperty("fetchSize", "0").toInt
> results in 0
> Now I know sql.load is deprecated, but this might be occuring on other 
> methods too. The workaround is to use the java.util.Properties overload, 
> which keeps the case sensitive keys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15545) R remove non-exported unused methods, like jsonRDD

2016-05-31 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308892#comment-15308892
 ] 

Miao Wang commented on SPARK-15545:
---

I will give a try.

Thanks

> R remove non-exported unused methods, like jsonRDD
> --
>
> Key: SPARK-15545
> URL: https://issues.apache.org/jira/browse/SPARK-15545
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Need to review what should be removed.
> one reason to not remove this right away is because we have been talking 
> about calling internal methods via `SparkR:::jsonRDD` for this and other RDD 
> methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-05-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307043#comment-15307043
 ] 

Takeshi Yamamuro edited comment on SPARK-15654 at 5/31/16 11:10 PM:


Seems a root cause is that LineRecordReader cannot split files compressed by 
some codecs.
hadoop-v2.8+ throws an exception if these kinds of files are passed into 
LineRecordReader 
(https://github.com/apache/hadoop/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L103)
This is fixed in MAPREDUCE-2094.

One solution is to set  Long.MaxValue at defaultMaxSplitBytes in 
FileSourceStrategy if unsplittable files detected there.
https://github.com/apache/spark/compare/master...maropu:SPARK-15654


was (Author: maropu):
Seems a root cause is that LineRecordReader cannot split files compressed by 
some codecs.
hadoop-v2.8+ throws an exception if these kinds of files are passed into 
LineRecordReader 
(https://github.com/apache/hadoop/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L103)
This is fixed in MAPREDUCE-2094.

One solution is to set  Long.MaxValue at defaultMaxSplitBytes in 
FileSourceStrategy if splittable files detected there.
https://github.com/apache/spark/compare/master...maropu:SPARK-15654

> Reading gzipped files results in duplicate rows
> ---
>
> Key: SPARK-15654
> URL: https://issues.apache.org/jira/browse/SPARK-15654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Priority: Blocker
>
> When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} 
> reading the file will result in duplicate rows in the dataframe.
> Given an example gzipped wordlist (of 740K bytes):
> {code}
> $ gzcat words.gz |wc -l
> 235886
> {code}
> Reading it using spark results in the following output:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 81244093
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 8348566
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 1051469
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 235886
> {code}
> You can clearly see how the number of rows scales with the number of 
> partitions.
> Somehow the data is duplicated when the number of partitions exceeds one 
> (which as seen above approximately scales with the partition size). 
> When using distinct you'll get the correct answer:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count()
> 235886
> {code}
> This looks like a pretty serious bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308824#comment-15308824
 ] 

Apache Spark commented on SPARK-6320:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13426

> Adding new query plan strategy to SQLContext
> 
>
> Key: SPARK-6320
> URL: https://issues.apache.org/jira/browse/SPARK-6320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Youssef Hatem
>Priority: Minor
>
> Hi,
> I would like to add a new strategy to {{SQLContext}}. To do this I created a 
> new class which extends {{Strategy}}. In my new class I need to call 
> {{planLater}} function. However this method is defined in {{SparkPlanner}} 
> (which itself inherits the method from {{QueryPlanner}}).
> To my knowledge the only way to make {{planLater}} function visible to my new 
> strategy is to define my strategy inside another class that extends 
> {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will 
> have to extend the {{SQLContext}} such that I can override the {{planner}} 
> field with the new {{Planner}} class I created.
> It seems that this is a design problem because adding a new strategy seems to 
> require extending {{SQLContext}} (unless I am doing it wrong and there is a 
> better way to do it).
> Thanks a lot,
> Youssef



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result


[ 
https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308815#comment-15308815
 ] 

Apache Spark commented on SPARK-15441:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13425

> dataset outer join seems to return incorrect result
> ---
>
> Key: SPARK-15441
> URL: https://issues.apache.org/jira/browse/SPARK-15441
> Project: Spark
>  Issue Type: Sub-task
>  Components: sq;
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), 
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +-+-+
> |   _1|   _2|
> +-+-+
> |(a,2)|(a,x)|
> |(a,1)|(a,x)|
> |(b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-31 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308797#comment-15308797
 ] 

Nick Pentreath commented on SPARK-15447:


Created a Google sheet with initial results: 
https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing

So far for SPARK-6717 I've just used {{spark-perf}} to compare the RDD-based 
APIs (as the checkpointing only impacts the RDD-based {{train}} method). From 
these results no red flags, and 2.0 is actually faster in general relative to 
1.6. Checkpointing does add a minor overhead (but this overhead is consistent 
across the versions and again better in 2.0).

There is something a little weird about the 1.6 results for 10m ratings case, 
but not sure what's going on there - I've rerun a few times with the same 
result.

Also, haven't managed to get to 1b ratings yet due to cluster size, will keep 
working on it.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-31 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15557:
---
Assignee: Dilip Biswal

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-31 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15557.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13368
[https://github.com/apache/spark/pull/13368]

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
> Fix For: 2.0.0
>
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15489) Dataset kryo encoder won't load custom user settings


 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15489:


Assignee: Apache Spark

> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>Assignee: Apache Spark
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554
> This could be hacked by providing those configurations as System properties, 
> but this probably should be passed to the encoder and set in the 
> SerializerInstance after creation.
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15489) Dataset kryo encoder won't load custom user settings


 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15489:


Assignee: (was: Apache Spark)

> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554
> This could be hacked by providing those configurations as System properties, 
> but this probably should be passed to the encoder and set in the 
> SerializerInstance after creation.
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15489) Dataset kryo encoder won't load custom user settings


[ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308794#comment-15308794
 ] 

Apache Spark commented on SPARK-15489:
--

User 'amitsela' has created a pull request for this issue:
https://github.com/apache/spark/pull/13424

> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554
> This could be hacked by providing those configurations as System properties, 
> but this probably should be passed to the encoder and set in the 
> SerializerInstance after creation.
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15076) Add ReorderAssociativeOperator optimizer

2016-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15076:
--
Description: 
This issue add a new optimizer  `ReorderAssociativeOperator` by taking 
advantage of integral associative property. Currently, Spark works like the 
following.

1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.

This issue can handle Case 2 for *Add/Multiply* expression whose data types are 
`ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the 
plan comparison between `before` and `after` this issue.

**Before**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
(a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}

**After**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) 
+ 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}

  was:
This issue add a new optimizer  `ConstantFolding` by taking advantage of 
integral associative property. Currently, `ConstantFolding` works like the 
following.

1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.

This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* 
expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and 
`LongType`. The followings are the plan comparision between `before` and 
`after` this issue.

**Before**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
(a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}

**After**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) 
+ 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}


> Add ReorderAssociativeOperator optimizer
> 
>
> Key: SPARK-15076
> URL: https://issues.apache.org/jira/browse/SPARK-15076
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>
> This issue add a new optimizer  `ReorderAssociativeOperator` by taking 
> advantage of integral associative property. Currently, Spark works like the 
> following.
> 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
> 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.
> This issue can handle Case 2 for *Add/Multiply* expression whose data types 
> are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings 
> are the plan comparison between `before` and `after` this issue.
> **Before**
> {code}
> scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
> a)").explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
> (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
> : +- INPUT
> +- Generate explode([1]), false, false, [a#7]
>+- Scan OneRowRelation[]
> {code}
> **After**
> {code}
> scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
> a)").explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 
> 8) + 9)#8]
> : +- INPUT
> +- Generate explode([1]), false, false, [a#7]
>+- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15076) Add ReorderAssociativeOperator optimizer

2016-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15076:
--
Summary: Add ReorderAssociativeOperator optimizer  (was: Improve 
ConstantFolding optimizer to use integral associative property)

> Add ReorderAssociativeOperator optimizer
> 
>
> Key: SPARK-15076
> URL: https://issues.apache.org/jira/browse/SPARK-15076
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>
> This issue improves `ConstantFolding` optimizer by taking advantage of 
> integral associative property. Currently, `ConstantFolding` works like the 
> following.
> 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
> 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.
> This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* 
> expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and 
> `LongType`. The followings are the plan comparision between `before` and 
> `after` this issue.
> **Before**
> {code}
> scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
> a)").explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
> (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
> : +- INPUT
> +- Generate explode([1]), false, false, [a#7]
>+- Scan OneRowRelation[]
> {code}
> **After**
> {code}
> scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
> a)").explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 
> 8) + 9)#8]
> : +- INPUT
> +- Generate explode([1]), false, false, [a#7]
>+- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15076) Add ReorderAssociativeOperator optimizer

2016-05-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15076:
--
Description: 
This issue add a new optimizer  `ConstantFolding` by taking advantage of 
integral associative property. Currently, `ConstantFolding` works like the 
following.

1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.

This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* 
expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and 
`LongType`. The followings are the plan comparision between `before` and 
`after` this issue.

**Before**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
(a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}

**After**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) 
+ 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}

  was:
This issue improves `ConstantFolding` optimizer by taking advantage of integral 
associative property. Currently, `ConstantFolding` works like the following.

1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.

This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* 
expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and 
`LongType`. The followings are the plan comparision between `before` and 
`after` this issue.

**Before**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
(a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}

**After**
{code}
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) 
+ 9)#8]
: +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
{code}


> Add ReorderAssociativeOperator optimizer
> 
>
> Key: SPARK-15076
> URL: https://issues.apache.org/jira/browse/SPARK-15076
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
>
> This issue add a new optimizer  `ConstantFolding` by taking advantage of 
> integral associative property. Currently, `ConstantFolding` works like the 
> following.
> 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
> 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.
> This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* 
> expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and 
> `LongType`. The followings are the plan comparision between `before` and 
> `after` this issue.
> **Before**
> {code}
> scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
> a)").explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS 
> (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
> : +- INPUT
> +- Generate explode([1]), false, false, [a#7]
>+- Scan OneRowRelation[]
> {code}
> **After**
> {code}
> scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) 
> a)").explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 
> 8) + 9)#8]
> : +- INPUT
> +- Generate explode([1]), false, false, [a#7]
>+- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15327) Catalyst code generation fails with complex data structure

2016-05-31 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15327.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13235
[https://github.com/apache/spark/pull/13235]

> Catalyst code generation fails with complex data structure
> --
>
> Key: SPARK-15327
> URL: https://issues.apache.org/jira/browse/SPARK-15327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
> Fix For: 2.0.0
>
> Attachments: full_exception.txt
>
>
> Spark code generation fails with the following error when loading parquet 
> files with a complex structure:
> {code}
> : java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an 
> rvalue
> {code}
> The generated code on line 158 looks like:
> {code}
> /* 153 */ this.scan_arrayWriter23 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 154 */ this.scan_rowWriter40 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 155 */   }
> /* 156 */   
> /* 157 */   private void scan_apply_0(InternalRow scan_row) {
> /* 158 */ if (scan_isNull) {
> /* 159 */   scan_rowWriter.setNullAt(0);
> /* 160 */ } else {
> /* 161 */   // Remember the current cursor so that we can calculate how 
> many bytes are
> /* 162 */   // written later.
> /* 163 */   final int scan_tmpCursor = scan_holder.cursor;
> /* 164 */   
> {code}
> How to reproduce (Pyspark): 
> {code}
> # Some complex structure
> json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": 
> 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], 
> "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": 
> {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": 
> [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": 
> "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": 
> [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", 
> "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", 
> "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": 
> "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}'
> # Write to parquet file
> sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test')
> # Try to read from parquet file (this generates an exception)
> sqlContext.read.parquet('test').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2016-05-31 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6859.
---
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.0.0

Fixed by upgrading parquet-mr to 1.8.1.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Assignee: Ryan Blue
>Priority: Minor
> Fix For: 2.0.0
>
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2016-05-31 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308754#comment-15308754
 ] 

Cheng Lian commented on SPARK-6859:
---

Yea, thanks. I'm closing it.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
> Fix For: 2.0.0
>
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7

2016-05-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308703#comment-15308703
 ] 

Sean Owen commented on SPARK-15451:
---

Yeah this has come up a few times in Spark -- there's a duplicate that I won't 
even bother to find and link here. Classic, weird cross compilation issue.

> Spark PR builder should fail if code doesn't compile against JDK 7
> --
>
> Key: SPARK-15451
> URL: https://issues.apache.org/jira/browse/SPARK-15451
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> We need to compile certain parts of the build using jdk8, so that we test 
> things like lambdas. But when possible, we should either compile using jdk7, 
> or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in 
> jdk8-specific library calls.
> I'll take a look at fixing the maven / sbt files, but I'm not sure how to 
> update the PR builders since this will most probably require at least a new 
> env variable (to say where jdk7 is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11293) ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods


 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11293:
---
Summary: ExternalSorter and ExternalAppendOnlyMap should free shuffle 
memory in their stop() methods  (was: Spillable collections leak shuffle memory)

> ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their 
> stop() methods
> ---
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13484) Filter outer joined result using a non-nullable column from the right table


[ 
https://issues.apache.org/jira/browse/SPARK-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308734#comment-15308734
 ] 

Apache Spark commented on SPARK-13484:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13290

> Filter outer joined result using a non-nullable column from the right table
> ---
>
> Key: SPARK-13484
> URL: https://issues.apache.org/jira/browse/SPARK-13484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>
> Technically speaking, this is not a bug. But
> {code}
> val a = sqlContext.range(10).select(col("id"), lit(0).as("count"))
> val b = sqlContext.range(10).select((col("id") % 
> 3).as("id")).groupBy("id").count()
> a.join(b, a("id") === b("id"), "left_outer").filter(b("count").isNull).show()
> {code}
> returns nothing. This is because `b("count")` is not nullable and the filter 
> condition is always false by static analysis. However, it is common for users 
> to use `a(...)` and `b(...)` to filter the joined result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory


 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11293:
---
Affects Version/s: (was: 1.6.1)

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11293) Spillable collections leak shuffle memory


 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11293.

   Resolution: Fixed
Fix Version/s: 1.6.0

This issue was fixed in 1.6.0 as part of the patch for SPARK-10984 (my original 
PR for that issue was subsumed by the PR for SPARK-10984). We will not backport 
this patch to 1.5.x or 1.4.x because we do not plan to have further 
non-security-fix releases for those branches.

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11293) Spillable collections leak shuffle memory