[jira] [Updated] (SPARK-15691) Refactor and improve Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15691: Summary: Refactor and improve Hive support (was: Refactor Hive support) > Refactor and improve Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-14343: -- Assignee: Cheng Lian > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Assignee: Cheng Lian >Priority: Critical > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15691) Refactor Hive support
[ https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15691: Description: Hive support is important to Spark SQL, as many Spark users use it to read from Hive. The current architecture is very difficult to maintain, and this ticket tracks progress towards getting us to a sane state. A number of things we want to was: Hive support is important to Spark SQL, as many Spark users use it to read from Hive. The current architecture is very difficult to maintain, and this ticket tracks progress towards getting us to a sane state. > Refactor Hive support > - > > Key: SPARK-15691 > URL: https://issues.apache.org/jira/browse/SPARK-15691 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Hive support is important to Spark SQL, as many Spark users use it to read > from Hive. The current architecture is very difficult to maintain, and this > ticket tracks progress towards getting us to a sane state. > A number of things we want to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15691) Refactor Hive support
Reynold Xin created SPARK-15691: --- Summary: Refactor Hive support Key: SPARK-15691 URL: https://issues.apache.org/jira/browse/SPARK-15691 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Hive support is important to Spark SQL, as many Spark users use it to read from Hive. The current architecture is very difficult to maintain, and this ticket tracks progress towards getting us to a sane state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14441) Consolidate DDL tests
[ https://issues.apache.org/jira/browse/SPARK-14441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14441. -- Resolution: Later seems we will not do it for 2.0.0. Let's resolve it as "Later". > Consolidate DDL tests > - > > Key: SPARK-14441 > URL: https://issues.apache.org/jira/browse/SPARK-14441 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Today we have DDLSuite, DDLCommandSuite, HiveDDLCommandSuite. It's confusing > whether a test should exist in one or the other. It also makes it less clear > whether our test coverage is comprehensive. Ideally we should consolidate > these files as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14118) Implement DDL/DML commands for Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14118. -- Resolution: Fixed Fix Version/s: 2.0.0 > Implement DDL/DML commands for Spark 2.0 > > > Key: SPARK-14118 > URL: https://issues.apache.org/jira/browse/SPARK-14118 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Blocker > Fix For: 2.0.0 > > > Right now, we have many DDL/DML commands that are passed to Hive, which may > cause missing functionality, failures with bad error messages, or > inconsistent behaviors (e.g. a command that works with some cases but fails > for other cases). For Spark 2.0, it will be great to not ask Hive to process > those DDL/DML commands. > You can find the doc at > https://issues.apache.org/jira/secure/attachment/12793435/Implementing%20native%20DDL%20and%20DML%20statements%20for%20Spark%202.pdf > (under SPARK-13879). > There are mainly two kinds of commands, (1) Native, i.e. we want to have > native implementation in Spark; (2) Exception, i.e. we should throw an > exception. That doc has a few commands that are marked as TBD. We should > first throw exceptions for them. > Sub-tasks are created based on the doc. A command is represented by its > corresponding Token. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15690: Description: Spark's current shuffle implementation sorts all intermediate data by their partition id, and then write the data to disk. This is not a big bottleneck because the network throughput on commodity clusters tend to be low. However, an increasing number of Spark users are using the system to process data on a single-node. When in a single node operating against intermediate data that fits in memory, the existing shuffle code path can become a big bottleneck. The goal of this ticket is to change Spark so it can use in-memory radix sort to do data shuffling on a single node, and still gracefully fallback to disk if the data size does not fit in memory. Given the number of partitions is usually small (say less than 256), it'd require only a single pass do to the radix sort with pretty decent CPU efficiency. Note that there have been many in-memory shuffle attempts in the past. This ticket has a smaller scope (single-process), and aims to actually productionize this code. was: An increasing number of Spark users are using the system to process data on a single-node. When in a single node operating against intermediate data that fits in memory, the existing shuffle code path can become a big bottleneck. Ideally, Spark should be able to use in-memory radix sort to do data shuffling on a single node > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures
[ https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036 ] Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:45 AM: -- So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append GroovyClass groovyClass = (GroovyClass) compileClass; JarArchiveEntry oneJarEntry = new JarArchiveEntry (String.format ("%s.class", groovyClass.getName ())); oneJarEntry.setSize (groovyClass.getBytes ().length); byte[] bytecodeOfClass = groovyClass.getBytes (); oneJar.putArchiveEntry (oneJarEntry); oneJar.write (bytecodeOfClass); oneJar.closeArchiveEntry (); } // End it up oneJar.finish (); oneJar.close (); } catch (Exception e) { // Do something } // Append the JAR to the execution environment sparkService.getSparkContext ().addJar (pathOfJar); // Now we evaluate the previously compiled script below // The executor tasks should by now have all *.class files in the JAR available; GroovyShell.evaluate (scriptText, nameOfScript); {noformat} Any idea on how this can be improved? (eg. not using the addJar method and the requirement to not dehydrate the Groovy closures) For now this works for us, after SPARK-13599 was fixed. However the extra code required to make it work could be a part of Spark instead, maybe using the broadcast mechanism or having a shared "cache" of classes around the cluster from where all the executor nodes can find/load-up dynamically compiled classes (eg. Groovy byte-code). was (Author: antauri): So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar
[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15690: Summary: Fast single-node (single-process) in-memory shuffle (was: Fast single-node in-memory shuffle) > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > An increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > Ideally, Spark should be able to use in-memory radix sort to do data > shuffling on a single node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures
[ https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036 ] Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:43 AM: -- So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append GroovyClass groovyClass = (GroovyClass) compileClass; JarArchiveEntry oneJarEntry = new JarArchiveEntry (String.format ("%s.class", groovyClass.getName ())); oneJarEntry.setSize (groovyClass.getBytes ().length); byte[] bytecodeOfClass = groovyClass.getBytes (); oneJar.putArchiveEntry (oneJarEntry); oneJar.write (bytecodeOfClass); oneJar.closeArchiveEntry (); } // End it up oneJar.finish (); oneJar.close (); } catch (Exception e) { // Do something } // Append the JAR to the execution environment sparkService.getSparkContext ().addJar (pathOfJar); // GroovyShell.evaluate (scriptText, nameOfScript) below; {noformat} Any idea on how this can be improved? (eg. not using the addJar method and the requirement to not dehydrate the Groovy closures) For now this works for us, after SPARK-13599 was fixed. However the extra code required to make it work could be a part of Spark instead, maybe using the broadcast mechanism or having a shared "cache" of classes around the cluster from where all the executor nodes can find/load-up dynamically compiled classes (eg. Groovy byte-code). was (Author: antauri): So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append
[jira] [Created] (SPARK-15690) Fast single-node in-memory shuffle
Reynold Xin created SPARK-15690: --- Summary: Fast single-node in-memory shuffle Key: SPARK-15690 URL: https://issues.apache.org/jira/browse/SPARK-15690 Project: Spark Issue Type: New Feature Components: Shuffle, SQL Reporter: Reynold Xin An increasing number of Spark users are using the system to process data on a single-node. When in a single node operating against intermediate data that fits in memory, the existing shuffle code path can become a big bottleneck. Ideally, Spark should be able to use in-memory radix sort to do data shuffling on a single node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309317#comment-15309317 ] Apache Spark commented on SPARK-14343: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/13431 > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Priority: Critical > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14343: Assignee: (was: Apache Spark) > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Priority: Critical > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14343: Assignee: Apache Spark > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Assignee: Apache Spark >Priority: Critical > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures
[ https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036 ] Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:41 AM: -- So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append GroovyClass groovyClass = (GroovyClass) compileClass; JarArchiveEntry oneJarEntry = new JarArchiveEntry (String.format ("%s.class", groovyClass.getName ())); oneJarEntry.setSize (groovyClass.getBytes ().length); byte[] bytecodeOfClass = groovyClass.getBytes (); oneJar.putArchiveEntry (oneJarEntry); oneJar.write (bytecodeOfClass); oneJar.closeArchiveEntry (); } // End it up oneJar.finish (); oneJar.close (); } catch (Exception e) { // Do something } // Append the JAR to the execution environment sparkService.getSparkContext ().addJar (pathOfJar); // GroovyShell.evaluate (scriptText, nameOfScript) below; {noformat} Any idea on how this can be improved? (eg. not using the addJar method and the requirement to not dehydrate the Groovy closures) For now this works for us, after SPARK-13599 was fixed. However the extra code required to make it work could be a part of Spark instead. was (Author: antauri): So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append GroovyClass groovyClass = (GroovyClass) compileClass; JarArchiveEntry oneJarEntry = new JarArchiveEntry (String.format ("%s.class", groovyClass.getName ()));
[jira] [Closed] (SPARK-11448) We should skip caching part-files in ParquetRelation when configured to merge schema and respect summaries
[ https://issues.apache.org/jira/browse/SPARK-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-11448. --- Resolution: Auto Closed > We should skip caching part-files in ParquetRelation when configured to merge > schema and respect summaries > -- > > Key: SPARK-11448 > URL: https://issues.apache.org/jira/browse/SPARK-11448 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > We now cache part-files, metadata, common metadata in ParquetRelation as > currentLeafStatuses. However, when configured to merge schema and respect > summaries, dataStatuses (`FileStatus` objects of all part-files) are not > necessary anymore. We should skip them when caching in driver side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15582) Support for Groovy closures
[ https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308036#comment-15308036 ] Catalin Alexandru Zamfir edited comment on SPARK-15582 at 6/1/16 5:36 AM: -- So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode. This must be done on the driver node before evaluating the script; - for each class from the above CU, we append it to the JAR on a distributed FS. We give each JAR an unique name, later passed on to addJar. This is just for isolation purposes so that each Script gets its own JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations and also nested closures don't seem to work; - however, if you design your code around the problem of Groovy closures, using interfaces/abstract classes and you extend or implement those classes in Groovy code it seems there is no limitation on what you can do (eg. an AbstractFilter class/interface); {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append GroovyClass groovyClass = (GroovyClass) compileClass; JarArchiveEntry oneJarEntry = new JarArchiveEntry (String.format ("%s.class", groovyClass.getName ())); oneJarEntry.setSize (groovyClass.getBytes ().length); byte[] bytecodeOfClass = groovyClass.getBytes (); oneJar.putArchiveEntry (oneJarEntry); oneJar.write (bytecodeOfClass); oneJar.closeArchiveEntry (); } // End it up oneJar.finish (); oneJar.close (); } catch (Exception e) { // Do something } // Append the JAR to the execution environment sparkService.getSparkContext ().addJar (pathOfJar); // GroovyShell.evaluate (scriptText, nameOfScript) below; {noformat} Any idea on how this can be improved? (eg. not using the addJar method and the requirement to not dehydrate the Groovy closures) was (Author: antauri): So here's the work-around, wished there was a way not to do it this way. In short, we are: - taking the script by text and using CompilationUnit (from Groovy) to compile the script to bytecode; - for each class we append it to the JAR; - the resulting JAR is added to the Spark context using addJar; - all closures must by dehydrated () which comes with its limitations; {noformat} // Compile it Long dateOfNow = DateTime.now ().getMillis (); String nameOfScript = String.format ("ScriptOf%d", dateOfNow); String pathOfJar = String.format ("distributed/fs/path/to/your/groovy-scripts/%s.jar", String.format ("JarOf%d", dateOfNow)); File archiveFile = new File (pathOfJar); Files.createParentDirs (archiveFile); // With resources List compilationList = compileGroovyScript (nameOfScript, sourceCode); try (JarArchiveOutputStream oneJar = new JarArchiveOutputStream (new FileOutputStream (new File (pathOfJar { // For for (Object compileClass : compilationList) { // Append GroovyClass groovyClass = (GroovyClass) compileClass; JarArchiveEntry oneJarEntry = new JarArchiveEntry (String.format ("%s.class", groovyClass.getName ())); oneJarEntry.setSize (groovyClass.getBytes ().length); byte[] bytecodeOfClass = groovyClass.getBytes (); oneJar.putArchiveEntry (oneJarEntry); oneJar.write (bytecodeOfClass); oneJar.closeArchiveEntry (); } // End it up oneJar.finish (); oneJar.close (); } catch (Exception e) { // Do something } // Append the JAR to the execution environment sparkService.getSparkContext ().addJar (pathOfJar); // GroovyShell.evaluate (scriptText, nameOfScript) below; {noformat} Any idea on how this can be improved? (eg. not using the
[jira] [Created] (SPARK-15689) Data source API v2
Reynold Xin created SPARK-15689: --- Summary: Data source API v2 Key: SPARK-15689 URL: https://issues.apache.org/jira/browse/SPARK-15689 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin This ticket tracks progress in creating the v2 of data source API. This new API should focus on: 1. Have a small surface so it is easy to freeze and maintain compatibility for a long time. Ideally, this API should survive architectural rewrites and user-facing API revamps of Spark. 2. Have a well-defined column batch interface for high performance. Convenience methods should exist to convert row-oriented formats into column batches for data source developers. 3. Still support filter push down, similar to the existing API. Note that both 1 and 2 are problems that the current data source API (v1) suffers. The current data source API has a wide surface with dependency on DataFrame/SQLContext, making the data source API compatibility depending on the upper level API. The current data source API is also only row oriented and has to go through an expensive external data type conversion to internal data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14827) Spark SQL run on Hive table reports The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
[ https://issues.apache.org/jira/browse/SPARK-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang closed SPARK-14827. -- Resolution: Won't Fix > Spark SQL run on Hive table reports The specified datastore driver > ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. > > > Key: SPARK-14827 > URL: https://issues.apache.org/jira/browse/SPARK-14827 > Project: Spark > Issue Type: Question > Components: SQL, YARN >Affects Versions: 1.5.2 > Environment: Hadoop-YARN 2.6.0 Spark-1.5.2 Hive-2.0.0 >Reporter: Wei Chen > Labels: hive, mysql, sql > > I have set up a YARN cluste and Hive. Hive uses mysql as its metadata store. > Hive works fine with MR engine. But when I ran Spark SQL on a table that was > already created before use following command: > ./spark-sql --master yarn --database hivedb -f tpch_query1 > The error message is like: > WARN Connection: BoneCP specified but not present in CLASSPATH (or one of > dependencies) > Exception in thread "main" java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392) > at > org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > ... 25 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at
[jira] [Commented] (SPARK-14827) Spark SQL run on Hive table reports The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH.
[ https://issues.apache.org/jira/browse/SPARK-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309304#comment-15309304 ] Jeff Zhang commented on SPARK-14827: It is not a bug as the exception is clear that jdbc driver jar is missing on classpath. > Spark SQL run on Hive table reports The specified datastore driver > ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. > > > Key: SPARK-14827 > URL: https://issues.apache.org/jira/browse/SPARK-14827 > Project: Spark > Issue Type: Question > Components: SQL, YARN >Affects Versions: 1.5.2 > Environment: Hadoop-YARN 2.6.0 Spark-1.5.2 Hive-2.0.0 >Reporter: Wei Chen > Labels: hive, mysql, sql > > I have set up a YARN cluste and Hive. Hive uses mysql as its metadata store. > Hive works fine with MR engine. But when I ran Spark SQL on a table that was > already created before use following command: > ./spark-sql --master yarn --database hivedb -f tpch_query1 > The error message is like: > WARN Connection: BoneCP specified but not present in CLASSPATH (or one of > dependencies) > Exception in thread "main" java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392) > at > org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > ... 25 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at >
[jira] [Updated] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15687: Description: This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support. In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types. The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas). Some of the important questions to answer are: >From the architectural perspective: - What is the end state architecture? - Should aggregation be columnar? - Should sorting be columnar? - How do we encode nested data? What are the operations on nested data, and how do we handle these operations in a columnar format? - What is the transition plan towards the end state? >From an external API perspective: - Can we expose a more efficient column batch user-defined function API? - How do we leverage this to integrate with 3rd party tools? - Can we have a spec for a fixed version of the column batch format that can be externalized and use that in data source API v2? was: This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support. In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types. The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas). Some of the important questions to answer are: - What is the end state architecture? - Should aggregation be columnar? - Should sorting be columnar? - How do we encode nested data? What are the operations on nested data, and how do we handle these operations in a columnar format? - What is the transition plan towards the end state? > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.
[ https://issues.apache.org/jira/browse/SPARK-15688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309299#comment-15309299 ] Xiao Li commented on SPARK-15688: - Yeah, will ask my teammate to do it ASAP. Thanks! > RelationalGroupedDataset.toDF should not add group by expressions that are > already added in the aggregate expressions. > -- > > Key: SPARK-15688 > URL: https://issues.apache.org/jira/browse/SPARK-15688 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to > have col appearing twice in the result. Seems we can avoid of output group by > expressions twice if it already a part of {{agg}}. Looks like > RelationalGroupedDataset.toDF is the place to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15687: Description: This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support. In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types. The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas). Some of the important questions to answer are: - What is the end state architecture? - Should aggregation be columnar? - Should sorting be columnar? - How do we encode nested data? What are the operations on nested data, and how do we handle these operations in a columnar format? - What is the transition plan towards the end state? was: This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support. In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types. The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas). Some of the important questions to answer are: - What is the end state architecture? - Should aggregation be columnar? - Should sorting be columnar? - How do we handle nested data types? - What is the transition plan towards the end state? > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.
[ https://issues.apache.org/jira/browse/SPARK-15688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309293#comment-15309293 ] Yin Huai commented on SPARK-15688: -- [~smilegator] Anyone from your side can take this? > RelationalGroupedDataset.toDF should not add group by expressions that are > already added in the aggregate expressions. > -- > > Key: SPARK-15688 > URL: https://issues.apache.org/jira/browse/SPARK-15688 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to > have col appearing twice in the result. Seems we can avoid of output group by > expressions twice if it already a part of {{agg}}. Looks like > RelationalGroupedDataset.toDF is the place to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15687: Summary: Columnar execution engine (was: Fully columnar execution engine) > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we handle nested data types? > - What is the transition plan towards the end state? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15687: Priority: Critical (was: Major) > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we handle nested data types? > - What is the transition plan towards the end state? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.
Yin Huai created SPARK-15688: Summary: RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions. Key: SPARK-15688 URL: https://issues.apache.org/jira/browse/SPARK-15688 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to have col appearing twice in the result. Seems we can avoid of output group by expressions twice if it already a part of {{agg}}. Looks like RelationalGroupedDataset.toDF is the place to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15687) Fully columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15687: Summary: Fully columnar execution engine (was: Columnar execution engine) > Fully columnar execution engine > --- > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we handle nested data types? > - What is the transition plan towards the end state? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15687) Columnar execution engine
Reynold Xin created SPARK-15687: --- Summary: Columnar execution engine Key: SPARK-15687 URL: https://issues.apache.org/jira/browse/SPARK-15687 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support. In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types. The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas). Some of the important questions to answer are: - What is the end state architecture? - Should aggregation be columnar? - Should sorting be columnar? - How do we handle nested data types? - What is the transition plan towards the end state? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode
[ https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brett Randall updated SPARK-15685: -- Description: Spark, when running in local mode, can encounter certain types of {{Error}} exceptions in developer-code or third-party libraries and call {{System.exit()}}, potentially killing a long-running JVM/service. The caller should decide on the exception-handling and whether the error should be deemed fatal. *Consider this scenario:* * Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance. * A task is run. The task errors with particular types of unchecked throwables: ** a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a {{StackOverflowError}}, or ** b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a {{NoClassDefFoundError}} is found. *Expected behaviour:* Since we are running in local mode, we might expect some unchecked exception to be thrown, to be optionally-handled by the Spark caller. In the case of Jetty, a request thread or some other background worker thread might handle the exception or not, the thread might exit or note an error. The caller should decide how the error is handled. *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty microservice is down and must be restarted. *Consequence:* Any local code or third-party library might cause Spark to exit the long-running JVM/microservice, so Spark can be a problem in this architecture. I have seen this now on three separate occasions, leading to service-down bug reports. *Analysis:* The line of code that seems to be the problem is: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405 {code} // Don't forcibly exit unless the exception was inherently fatal, to avoid // stopping other tasks unnecessarily. if (Utils.isFatalError(t)) { SparkUncaughtExceptionHandler.uncaughtException(t) } {code} [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818] first excludes Scala [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31], which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}. {{Utils.isFatalError()}} further excludes {{InterruptedException}}, {{NotImplementedError}} and {{ControlThrowable}}. Remaining are {{Error}} s such as {{StackOverflowError extends VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which occur in the aforementioned scenarios. {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call {{System.exit()}}. [Further up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77] in in {{Executor}} we see exclusions for registering {{SparkUncaughtExceptionHandler}} if in local mode: {code} if (!isLocal) { // Setup an uncaught exception handler for non-local mode. // Make any thread terminations due to uncaught exceptions kill the entire // executor process to avoid surprising stalls. Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) } {code} This same exclusion must be applied for local mode for "fatal" errors - cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide. A minimal test-case is supplied. It installs a logging {{SecurityManager}} to confirm that {{System.exit()}} was called from {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}. It also hints at the workaround - install your own {{SecurityManager}} and inspect the current stack in {{checkExit()}} to prevent Spark from exiting the JVM. Test-case: https://github.com/javabrett/SPARK-15685 . was: Spark, when running in local mode, can encounter certain types of {{Error}} exceptions in developer-code or third-party libraries and call {{System.exit()}}, potentially killing a long-running JVM/service. The caller should decide on the exception-handling and whether the error should be deemed fatal. *Consider this scenario:* * Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance. * A task is run. The task errors with particular types of unchecked throwables: ** a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a {{StackOverflowError}}, or ** b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a {{NoClassDefFoundError}} is found. *Expected
[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
[ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309261#comment-15309261 ] Takeshi Yamamuro commented on SPARK-15530: -- Oh, I see. Okay, then I'll check the logs and make pr later. > Partitioning discovery logic HadoopFsRelation should use a higher setting of > parallelism > > > Key: SPARK-15530 > URL: https://issues.apache.org/jira/browse/SPARK-15530 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, > we launch a spark job to do parallel file listing in order to discover > partitions. However, we do not set the number of partitions at here, which > means that we are using the default parallelism of the cluster. It is better > to set the number of partitions explicitly to generate smaller tasks, which > help load balancing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive
[ https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309253#comment-15309253 ] Yin Huai commented on SPARK-15032: -- I did a quick check today. Seems our current master it is fine. When we open a JDBC session, we indeed create the session state correctly. I am closing this jira. > When we create a new JDBC session, we may need to create a new session of > executionHive > --- > > Key: SPARK-15032 > URL: https://issues.apache.org/jira/browse/SPARK-15032 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, we only use executionHive in thriftserver. When we create a new > jdbc session, we probably need to create a new session of executionHive. I am > not sure what will break if we leave the code as is. But, I feel it will be > safer to create a new session of executionHive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive
[ https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15032. -- Resolution: Not A Problem > When we create a new JDBC session, we may need to create a new session of > executionHive > --- > > Key: SPARK-15032 > URL: https://issues.apache.org/jira/browse/SPARK-15032 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, we only use executionHive in thriftserver. When we create a new > jdbc session, we probably need to create a new session of executionHive. I am > not sure what will break if we leave the code as is. But, I feel it will be > safer to create a new session of executionHive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
[ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309250#comment-15309250 ] Yin Huai commented on SPARK-15530: -- The doc of that conf says "The degree of parallelism for schema merging and partition discovery of Parquet data sources.". So seems it just means parallelism. I am not sure why the variable name has a "threshold" at the end. It will be great if you can check when we added that conf. Right now, I am inclined to use that conf to configure the parallelism (we need to make that conf an internal conf though). > Partitioning discovery logic HadoopFsRelation should use a higher setting of > parallelism > > > Key: SPARK-15530 > URL: https://issues.apache.org/jira/browse/SPARK-15530 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, > we launch a spark job to do parallel file listing in order to discover > partitions. However, we do not set the number of partitions at here, which > means that we are using the default parallelism of the cluster. It is better > to set the number of partitions explicitly to generate smaller tasks, which > help load balancing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15672) R programming guide update
[ https://issues.apache.org/jira/browse/SPARK-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309249#comment-15309249 ] Shivaram Venkataraman commented on SPARK-15672: --- Thanks [~GayathriMurali] - This PR can handle changes other than ML to the programming guide > R programming guide update > -- > > Key: SPARK-15672 > URL: https://issues.apache.org/jira/browse/SPARK-15672 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Blocker > > Update the the programming guide (i.e. the document at > http://spark.apache.org/docs/latest/sparkr.html) to cover the major new > features in Spark 2.0. This will include > (a) UDFs with dapply, dapplyCollect > (b) group UDFs with gapply > (c) spark.lapply for running parallel R functions > (d) others ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
[ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309246#comment-15309246 ] Takeshi Yamamuro commented on SPARK-15530: -- You suggest we should rename the value and change the default value? > Partitioning discovery logic HadoopFsRelation should use a higher setting of > parallelism > > > Key: SPARK-15530 > URL: https://issues.apache.org/jira/browse/SPARK-15530 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, > we launch a spark job to do parallel file listing in order to discover > partitions. However, we do not set the number of partitions at here, which > means that we are using the default parallelism of the cluster. It is better > to set the number of partitions explicitly to generate smaller tasks, which > help load balancing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
[ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309240#comment-15309240 ] Yin Huai commented on SPARK-15530: -- Your change looks reasonable. How about we just take the value of sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold? We can change the doc of this conf and increate the default value. > Partitioning discovery logic HadoopFsRelation should use a higher setting of > parallelism > > > Key: SPARK-15530 > URL: https://issues.apache.org/jira/browse/SPARK-15530 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, > we launch a spark job to do parallel file listing in order to discover > partitions. However, we do not set the number of partitions at here, which > means that we are using the default parallelism of the cluster. It is better > to set the number of partitions explicitly to generate smaller tasks, which > help load balancing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15683) spark sql local FS spark.sql.warehouse.dir throws on YARN
[ https://issues.apache.org/jira/browse/SPARK-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309139#comment-15309139 ] Saisai Shao commented on SPARK-15683: - Hi [~tgraves], please see this JIRA SPARK-15659, I submitted a patch about it. > spark sql local FS spark.sql.warehouse.dir throws on YARN > - > > Key: SPARK-15683 > URL: https://issues.apache.org/jira/browse/SPARK-15683 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Critical > > I'm trying to use dataframes with spark 2.0. It was built with hive but when > I try to run a dataframe command I get the error: > 16/05/31 20:24:21 ERROR ApplicationMaster: User class threw exception: > java.lang.IllegalArgumentException: Wrong FS: > file:/grid/2/tmp/yarn-local/usercache/tgraves/appcache/application_1464289177693_1036410/container_e14_1464289177693_1036410_01_01/spark-warehouse, > expected: hdfs://nn1.com:8020 > java.lang.IllegalArgumentException: Wrong FS: > file:/grid/2/tmp/yarn-local/usercache/tgraves/appcache/application_1464289177693_1036410/container_e14_1464289177693_1036410_01_01/spark-warehouse, > expected: hdfs://nn1.com:8020 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1047) > at > org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1043) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:1061) > at > org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:1036) > at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1880) > at > org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:123) > at > org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.createDatabase(InMemoryCatalog.scala:122) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:142) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:84) > at > org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:94) > at > org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:94) > at > org.apache.spark.sql.internal.SessionState$$anon$1.(SessionState.scala:110) > at > org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:110) > at > org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:109) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) > at > org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:371) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:154) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:419) > at yahoo.spark.SparkFlickrLargeJoin$.main(SparkFlickrLargeJoin.scala:26) > at yahoo.spark.SparkFlickrLargeJoin.main(SparkFlickrLargeJoin.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:617) > It seems https://issues.apache.org/jira/browse/SPARK-15565 change it to have > default local fs. Even before that it didn't work either, just different > error -> https://issues.apache.org/jira/browse/SPARK-15034. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12988: - Assignee: Sean Zhong > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust >Assignee: Sean Zhong > Fix For: 2.0.0 > > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12988. -- Resolution: Fixed Fix Version/s: 2.0.0 > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust >Assignee: Sean Zhong > Fix For: 2.0.0 > > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309135#comment-15309135 ] Yin Huai commented on SPARK-12988: -- This issue has been resolved by https://github.com/apache/spark/pull/13306. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust >Assignee: Sean Zhong > Fix For: 2.0.0 > > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309112#comment-15309112 ] Gayathri Murali commented on SPARK-14381: - I will work on this > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15672) R programming guide update
[ https://issues.apache.org/jira/browse/SPARK-15672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309100#comment-15309100 ] Gayathri Murali commented on SPARK-15672: - [~shivaram] I am working on changing R documentation to include all changes that happened with ML. Here is the link to the PR : https://github.com/apache/spark/pull/13285 > R programming guide update > -- > > Key: SPARK-15672 > URL: https://issues.apache.org/jira/browse/SPARK-15672 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Priority: Blocker > > Update the the programming guide (i.e. the document at > http://spark.apache.org/docs/latest/sparkr.html) to cover the major new > features in Spark 2.0. This will include > (a) UDFs with dapply, dapplyCollect > (b) group UDFs with gapply > (c) spark.lapply for running parallel R functions > (d) others ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode
[ https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brett Randall updated SPARK-15685: -- Description: Spark, when running in local mode, can encounter certain types of {{Error}} exceptions in developer-code or third-party libraries and call {{System.exit()}}, potentially killing a long-running JVM/service. The caller should decide on the exception-handling and whether the error should be deemed fatal. *Consider this scenario:* * Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance. * A task is run. The task errors with particular types of unchecked throwables: ** a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a {{StackOverflowError}}, or ** b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a {{NoClassDefFoundError}} is found. *Expected behaviour:* Since we are running in local mode, we might expect some unchecked exception to be thrown, to be optionally-handled by the Spark caller. In the case of Jetty, a request thread or some other background worker thread might handle the exception or not, the thread might exit or note an error. The caller should decide how the error is handled. *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty microservice is down and must be restarted. *Consequence:* Any local code or third-party library might cause Spark to exit the long-running JVM/microservice, so Spark can be a problem in this architecture. I have seen this now on three separate occasions, leading to service-down bug reports. *Analysis:* The line of code that seems to be the problem is: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405 {code} // Don't forcibly exit unless the exception was inherently fatal, to avoid // stopping other tasks unnecessarily. if (Utils.isFatalError(t)) { SparkUncaughtExceptionHandler.uncaughtException(t) } {code} [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818] first excludes Scala [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31], which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}. {{Utils.isFatalError()}} further excludes {{InterruptedException}}, {{NotImplementedError}} and {{ControlThrowable}}. Remaining are {{Error}}s such as {{StackOverflowError extends VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which occur in the aforementioned scenarios. {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call {{System.exit()}}. [Further up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77] in in {{Executor}} we see exclusions for registering {{SparkUncaughtExceptionHandler}} if in local mode: {code} if (!isLocal) { // Setup an uncaught exception handler for non-local mode. // Make any thread terminations due to uncaught exceptions kill the entire // executor process to avoid surprising stalls. Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) } {code} This same exclusion must be applied for local mode for "fatal" errors - cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide. A minimal test-case is supplied. It installs a logging {{SecurityManager}} to confirm that {{System.exit()}} was called from {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}. It also hints at the workaround - install your own {{SecurityManager}} and inspect the current stack in {{checkExit()}} to prevent Spark from exiting the JVM. Test-case: https://github.com/javabrett/SPARK-15685 . was: Spark, when running in local mode, can encounter certain types of {{Error}} exceptions in developer-code or third-party libraries and call {{System.exit()}}, potentially killing a long-running JVM/service. The caller should decide on the exception-handling and whether the error should be deemed fatal. *Consider this scenario:* * Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance. * A task is run. The task errors with particular types of unchecked throwables: ** a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a {{StackOverflowError}}, or ** b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a {{NoClassDefFoundError}} is found. *Expected
[jira] [Commented] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming
[ https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309075#comment-15309075 ] Apache Spark commented on SPARK-15686: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13429 > Move user-facing structured streaming classes into sql.streaming > > > Key: SPARK-15686 > URL: https://issues.apache.org/jira/browse/SPARK-15686 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming
[ https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15686: Assignee: Reynold Xin (was: Apache Spark) > Move user-facing structured streaming classes into sql.streaming > > > Key: SPARK-15686 > URL: https://issues.apache.org/jira/browse/SPARK-15686 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming
[ https://issues.apache.org/jira/browse/SPARK-15686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15686: Assignee: Apache Spark (was: Reynold Xin) > Move user-facing structured streaming classes into sql.streaming > > > Key: SPARK-15686 > URL: https://issues.apache.org/jira/browse/SPARK-15686 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15686) Move user-facing structured streaming classes into sql.streaming
Reynold Xin created SPARK-15686: --- Summary: Move user-facing structured streaming classes into sql.streaming Key: SPARK-15686 URL: https://issues.apache.org/jira/browse/SPARK-15686 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode
Brett Randall created SPARK-15685: - Summary: StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode Key: SPARK-15685 URL: https://issues.apache.org/jira/browse/SPARK-15685 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Reporter: Brett Randall Priority: Critical Spark, when running in local mode, can encounter certain types of {{Error}} exceptions in developer-code or third-party libraries and call {{System.exit()}}, potentially killing a long-running JVM/service. The caller should decide on the exception-handling and whether the error should be deemed fatal. *Consider this scenario:* * Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance. * A task is run. The task errors with particular types of unchecked throwables: ** a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a {{StackOverflowError}}, or ** b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a {{NoClassDefFoundError}} is found. *Expected behaviour:* Since we are running in local mode, we might expect some unchecked exception to be thrown, to be optionally-handled by the Spark caller. In the case of Jetty, a request thread or some other background worker thread might handle the exception or not, the thread might exit or note an error. The caller should decide how the error is handled. *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty microservice is down and must be restarted. *Consequence:* Any local code or third-party library might cause Spark to exit the long-running JVM/microservice, so Spark can be a problem in this architecture. I have seen this now on three separate occasions, leading to service-down bug reports. *Analysis:* The line of code that seems to be the problem is: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405 {code} // Don't forcibly exit unless the exception was inherently fatal, to avoid // stopping other tasks unnecessarily. if (Utils.isFatalError(t)) { SparkUncaughtExceptionHandler.uncaughtException(t) } {code} [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818] first excludes Scala [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31], which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}. {{Utils.isFatalError()}} further excludes {{InterruptedException}}, {{NotImplementedError}} and {{ControlThrowable}}. Remaining are {{Error}}s such as {{StackOverflowError extends VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which occur in the aforementioned scenarios. {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call {{System.exit()}}. [Further up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77] in in {{Executor}} we see exclusions for registering {{SparkUncaughtExceptionHandler}} if in local mode: {code} if (!isLocal) { // Setup an uncaught exception handler for non-local mode. // Make any thread terminations due to uncaught exceptions kill the entire // executor process to avoid surprising stalls. Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) } {code} This same exclusion must be applied for local mode for "fatal" errors - cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide. A minimal test-case is supplied. It installs a logging {{SecurityManager}} to confirm that {{System.exit()}} was called from {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}. It also hints at the workaround - install your own {{SecurityManager}} and inspect the current stack in {{checkExit()}} to prevent Spark from exiting the JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel
[ https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-15681: --- Description: Currently SparkContext API setLogLevel(level: String) can not handle lower case or mixed case input string. But org.apache.log4j.Level.toLevel can take lowercase or mixed case. was: Currently SparkContext API setLogLevel(level: String) can not handle lower case or mixed case input string. But org.apache.log4j.Level.toLevel can take lowercase or mixed case. Also resetLogLevel to original configuration could be helpful for users to switch log level for different diagnostic purposes. > Allow case-insensitiveness in sc.setLogLevel > > > Key: SPARK-15681 > URL: https://issues.apache.org/jira/browse/SPARK-15681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Minor > > Currently SparkContext API setLogLevel(level: String) can not handle lower > case or mixed case input string. But org.apache.log4j.Level.toLevel can take > lowercase or mixed case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15681) Allow case-insensitiveness in sc.setLogLevel
[ https://issues.apache.org/jira/browse/SPARK-15681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-15681: --- Summary: Allow case-insensitiveness in sc.setLogLevel (was: Allow case-insensitiveness in sc.setLogLevel and support sc.resetLogLevel) > Allow case-insensitiveness in sc.setLogLevel > > > Key: SPARK-15681 > URL: https://issues.apache.org/jira/browse/SPARK-15681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Minor > > Currently SparkContext API setLogLevel(level: String) can not handle lower > case or mixed case input string. But org.apache.log4j.Level.toLevel can take > lowercase or mixed case. > Also resetLogLevel to original configuration could be helpful for users to > switch log level for different diagnostic purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309033#comment-15309033 ] Joseph K. Bradley commented on SPARK-15581: --- The roadmap links to [SPARK-4591], which contains it. That task is also listed as Critical and targeted at 2.1, so it should be prioritized. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal of the Python API is to have feature parity
[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15581: -- Description: This is a master list for MLlib improvements we are working on for the next release. Please view this as a wish list rather than a definite plan, for we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add the `@Since("VERSION")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps to improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add a "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if applicable. h1. Roadmap (*WIP*) This is NOT [a complete list of MLlib JIRAs for 2.1| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. We only include umbrella JIRAs and high-level tasks. Major efforts in this release: * Feature parity for the DataFrames-based API (`spark.ml`), relative to the RDD-based API * ML persistence * Python API feature parity and test coverage * R API expansion and improvements * Note about new features: As usual, we expect to expand the feature set of MLlib. However, we will prioritize API parity, bug fixes, and improvements over new features. Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for it, but new features, APIs, and improvements will only be added to `spark.ml`. h2. Critical feature parity in DataFrame-based API * Umbrella JIRA: [SPARK-4591] h2. Persistence * Complete persistence within MLlib ** Python tuning (SPARK-13786) * MLlib in R format: compatibility with other languages (SPARK-15572) * Impose backwards compatibility for persistence (SPARK-15573) h2. Python API * Standardize unit tests for Scala and Python to improve and consolidate test coverage for Params, persistence, and other common functionality (SPARK-15571) * Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706) ** Note: The linked JIRAs for this are incomplete. More to be created... ** Related: Implement Python meta-algorithms in Scala (to simplify persistence) (SPARK-15574) * Feature parity: The main goal of the Python API is to have feature parity with the Scala/Java API. You can find a [complete list here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC]. The tasks fall into two major categories: ** Python API for missing methods (SPARK-14813) ** Python API for new algorithms. Committers should create a JIRA for the Python API after merging a public feature in Scala/Java. h2. SparkR * Improve R formula support and implementation (SPARK-15540) *
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309031#comment-15309031 ] Joseph K. Bradley commented on SPARK-12347: --- [~junzheng] Feel free to work on it; sorry for the slow response. I'd recommending pinging [~holdenk] on [SPARK-12428] to sync and split up the work. I just targeted this for 2.1 since [SPARK-15605] uncovered a broken example. Having tests would make it much easier to find the culprit PR! > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12347: -- Target Version/s: 2.1.0 Priority: Critical (was: Major) > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15605) ML JavaDeveloperApiExample is broken
[ https://issues.apache.org/jira/browse/SPARK-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15605: -- Priority: Major (was: Minor) > ML JavaDeveloperApiExample is broken > > > Key: SPARK-15605 > URL: https://issues.apache.org/jira/browse/SPARK-15605 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Affects Versions: 2.0.0 >Reporter: Yanbo Liang > > This bug is reported by > http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html > . > To fix the issue mentioned in the mailing list, we can implement maxIter as > following: > {code} > private IntParam maxIter_; > public IntParam maxIter() { > return maxIter_; > } > public int getMaxIter() { > return (Integer) getOrDefault(maxIter_); > } > public MyJavaLogisticRegression setMaxIter(int value) { > return (MyJavaLogisticRegression) set(maxIter_, value); > } > private void init() { > maxIter_ = new IntParam(this, "maxIter", "max number of iterations"); > setDefault(maxIter(), 100); > } > {code} > But this does not solve all problems, it will throw new exceptions: > {code} > Exception in thread "main" java.lang.IllegalArgumentException: requirement > failed: Param null__featuresCol does not belong to myJavaLogReg_fbcf0d015036. > {code} > The shared params such as "featuresCol" should also have Java-friendly > wrappers. > Look through the code base, I found we only implement JavaParams who is the > wrappers of Scala Params. > We still need Java-friendly wrappers for other traits who extends from Scala > Params. > For example, in Scala we have: > {code} > trait HasLabelCol extends Params > {code} > We should have the Java-friendly wrappers as follows: > {code} > class JavaHasLabelCol extends Params > {code} > Otherwise, "Param.parent" will be null for each param and it will throw > exceptions when calling "Param.hasParam()". > I think the Java compatibility for Params/Param should be further defined in > the next release cycle, and it's better to remove the > "JavaDeveloperApiExample" at present. Since currently we can not well support > users to implement their own algorithm using Estimator/Transformer by Java, > it may mislead users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15605) ML JavaDeveloperApiExample is broken
[ https://issues.apache.org/jira/browse/SPARK-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15605: -- Affects Version/s: 2.0.0 > ML JavaDeveloperApiExample is broken > > > Key: SPARK-15605 > URL: https://issues.apache.org/jira/browse/SPARK-15605 > Project: Spark > Issue Type: Bug > Components: Examples, ML >Affects Versions: 2.0.0 >Reporter: Yanbo Liang > > This bug is reported by > http://apache-spark-developers-list.1001551.n3.nabble.com/Creation-of-SparkML-Estimators-in-Java-broken-td17710.html > . > To fix the issue mentioned in the mailing list, we can implement maxIter as > following: > {code} > private IntParam maxIter_; > public IntParam maxIter() { > return maxIter_; > } > public int getMaxIter() { > return (Integer) getOrDefault(maxIter_); > } > public MyJavaLogisticRegression setMaxIter(int value) { > return (MyJavaLogisticRegression) set(maxIter_, value); > } > private void init() { > maxIter_ = new IntParam(this, "maxIter", "max number of iterations"); > setDefault(maxIter(), 100); > } > {code} > But this does not solve all problems, it will throw new exceptions: > {code} > Exception in thread "main" java.lang.IllegalArgumentException: requirement > failed: Param null__featuresCol does not belong to myJavaLogReg_fbcf0d015036. > {code} > The shared params such as "featuresCol" should also have Java-friendly > wrappers. > Look through the code base, I found we only implement JavaParams who is the > wrappers of Scala Params. > We still need Java-friendly wrappers for other traits who extends from Scala > Params. > For example, in Scala we have: > {code} > trait HasLabelCol extends Params > {code} > We should have the Java-friendly wrappers as follows: > {code} > class JavaHasLabelCol extends Params > {code} > Otherwise, "Param.parent" will be null for each param and it will throw > exceptions when calling "Param.hasParam()". > I think the Java compatibility for Params/Param should be further defined in > the next release cycle, and it's better to remove the > "JavaDeveloperApiExample" at present. Since currently we can not well support > users to implement their own algorithm using Estimator/Transformer by Java, > it may mislead users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15678) Not use cache on appends and overwrites
[ https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-15678: --- Summary: Not use cache on appends and overwrites (was: Drop cache on appends and overwrites) > Not use cache on appends and overwrites > --- > > Key: SPARK-15678 > URL: https://issues.apache.org/jira/browse/SPARK-15678 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Sameer Agarwal > > SparkSQL currently doesn't drop caches if the underlying data is overwritten. > {code} > val dir = "/tmp/test" > sqlContext.range(1000).write.mode("overwrite").parquet(dir) > val df = sqlContext.read.parquet(dir).cache() > df.count() // outputs 1000 > sqlContext.range(10).write.mode("overwrite").parquet(dir) > sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We > are still using the cached dataset > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15675) Implicit resolution doesn't work in multiple Statements in Spark Repl
[ https://issues.apache.org/jira/browse/SPARK-15675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308998#comment-15308998 ] Prashant Sharma commented on SPARK-15675: - Looks like https://github.com/scala/scala/pull/5084/files looks into this issue. Just to verify, I am to going to run spark with special build of scala, to see if this upstream fix nails it. > Implicit resolution doesn't work in multiple Statements in Spark Repl > -- > > Key: SPARK-15675 > URL: https://issues.apache.org/jira/browse/SPARK-15675 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Shixiong Zhu > > Here is the reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Y(i: Int) > Seq(Y(1)).toDS() > // Exiting paste mode, now interpreting. > :14: error: value toDS is not a member of Seq[Y] >Seq(Y(1)).toDS() > ^ > scala> case class Y(i: Int); Seq(Y(1)).toDS() > :13: error: value toDS is not a member of Seq[Y] > Seq(Y(1)).toDS() > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14343: - Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-15631) > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Priority: Critical > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14343: - Target Version/s: 2.0.0 Priority: Critical (was: Blocker) > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Priority: Critical > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15601) CircularBuffer's toString() to print only the contents written if buffer isn't full
[ https://issues.apache.org/jira/browse/SPARK-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15601. --- Resolution: Fixed Fix Version/s: 1.6.2 2.0.0 Issue resolved by pull request 13351 [https://github.com/apache/spark/pull/13351] > CircularBuffer's toString() to print only the contents written if buffer > isn't full > --- > > Key: SPARK-15601 > URL: https://issues.apache.org/jira/browse/SPARK-15601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Trivial > Fix For: 2.0.0, 1.6.2 > > > [~sameerag] suggested this over a PR discussion: > https://github.com/apache/spark/pull/12194#discussion_r64495331 > If CircularBuffer isn't full, currently toString() will print some garbage > chars along with the content written as is tries to print the entire array > allocated for the buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15675) Implicit resolution doesn't work in multiple Statements in Spark Repl
[ https://issues.apache.org/jira/browse/SPARK-15675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308989#comment-15308989 ] Prashant Sharma commented on SPARK-15675: - Interesting, I am looking into it. I will have to add tests in scala/scala, to guard against changes that break these things. I am assuming something has changed around how repl handles multiline statements. Because, I think this did not happen earlier. > Implicit resolution doesn't work in multiple Statements in Spark Repl > -- > > Key: SPARK-15675 > URL: https://issues.apache.org/jira/browse/SPARK-15675 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Shixiong Zhu > > Here is the reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Y(i: Int) > Seq(Y(1)).toDS() > // Exiting paste mode, now interpreting. > :14: error: value toDS is not a member of Seq[Y] >Seq(Y(1)).toDS() > ^ > scala> case class Y(i: Int); Seq(Y(1)).toDS() > :13: error: value toDS is not a member of Seq[Y] > Seq(Y(1)).toDS() > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15236. --- Resolution: Fixed Fix Version/s: 2.0.0 > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Xin Wu > Fix For: 2.0.0 > > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15618. --- Resolution: Fixed Fix Version/s: 2.0.0 > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15236: -- Assignee: Xin Wu > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Xin Wu > Fix For: 2.0.0 > > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
[ https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12666: Assignee: Apache Spark > spark-shell --packages cannot load artifacts which are publishLocal'd by SBT > > > Key: SPARK-12666 > URL: https://issues.apache.org/jira/browse/SPARK-12666 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.1, 1.6.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > Symptom: > I cloned the latest master of {{spark-redshift}}, then used {{sbt > publishLocal}} to publish it to my Ivy cache. When I tried running > {{./bin/spark-shell --packages > com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency > into {{spark-shell}}, I received the following cryptic error: > {code} > Exception in thread "main" java.lang.RuntimeException: [unresolved > dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration > not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It > was required from org.apache.spark#spark-submit-parent;1.0 default] > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > I think the problem here is that Spark is declaring a dependency on the > spark-redshift artifact using the {{default}} Ivy configuration. Based on my > admittedly limited understanding of Ivy, the default configuration will be > the only configuration defined in an Ivy artifact if that artifact defines no > other configurations. Thus, for Maven artifacts I think the default > configuration will end up mapping to Maven's regular JAR dependency (i.e. > Maven artifacts don't declare Ivy configurations so they implicitly have the > {{default}} configuration) but for Ivy artifacts I think we can run into > trouble when loading artifacts which explicitly define their own > configurations, since those artifacts might not have a configuration named > {{default}}. > I spent a bit of time playing around with the SparkSubmit code to see if I > could fix this but wasn't able to completely resolve the issue. > /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, > if you'd like) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
[ https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308965#comment-15308965 ] Apache Spark commented on SPARK-12666: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/13428 > spark-shell --packages cannot load artifacts which are publishLocal'd by SBT > > > Key: SPARK-12666 > URL: https://issues.apache.org/jira/browse/SPARK-12666 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.1, 1.6.0 >Reporter: Josh Rosen > > Symptom: > I cloned the latest master of {{spark-redshift}}, then used {{sbt > publishLocal}} to publish it to my Ivy cache. When I tried running > {{./bin/spark-shell --packages > com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency > into {{spark-shell}}, I received the following cryptic error: > {code} > Exception in thread "main" java.lang.RuntimeException: [unresolved > dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration > not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It > was required from org.apache.spark#spark-submit-parent;1.0 default] > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > I think the problem here is that Spark is declaring a dependency on the > spark-redshift artifact using the {{default}} Ivy configuration. Based on my > admittedly limited understanding of Ivy, the default configuration will be > the only configuration defined in an Ivy artifact if that artifact defines no > other configurations. Thus, for Maven artifacts I think the default > configuration will end up mapping to Maven's regular JAR dependency (i.e. > Maven artifacts don't declare Ivy configurations so they implicitly have the > {{default}} configuration) but for Ivy artifacts I think we can run into > trouble when loading artifacts which explicitly define their own > configurations, since those artifacts might not have a configuration named > {{default}}. > I spent a bit of time playing around with the SparkSubmit code to see if I > could fix this but wasn't able to completely resolve the issue. > /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, > if you'd like) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
[ https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12666: Assignee: (was: Apache Spark) > spark-shell --packages cannot load artifacts which are publishLocal'd by SBT > > > Key: SPARK-12666 > URL: https://issues.apache.org/jira/browse/SPARK-12666 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.1, 1.6.0 >Reporter: Josh Rosen > > Symptom: > I cloned the latest master of {{spark-redshift}}, then used {{sbt > publishLocal}} to publish it to my Ivy cache. When I tried running > {{./bin/spark-shell --packages > com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency > into {{spark-shell}}, I received the following cryptic error: > {code} > Exception in thread "main" java.lang.RuntimeException: [unresolved > dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration > not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It > was required from org.apache.spark#spark-submit-parent;1.0 default] > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > I think the problem here is that Spark is declaring a dependency on the > spark-redshift artifact using the {{default}} Ivy configuration. Based on my > admittedly limited understanding of Ivy, the default configuration will be > the only configuration defined in an Ivy artifact if that artifact defines no > other configurations. Thus, for Maven artifacts I think the default > configuration will end up mapping to Maven's regular JAR dependency (i.e. > Maven artifacts don't declare Ivy configurations so they implicitly have the > {{default}} configuration) but for Ivy artifacts I think we can run into > trouble when loading artifacts which explicitly define their own > configurations, since those artifacts might not have a configuration named > {{default}}. > I spent a bit of time playing around with the SparkSubmit code to see if I > could fix this but wasn't able to completely resolve the issue. > /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, > if you'd like) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
[ https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15670: -- Assignee: Weichen Xu > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class > -- > > Key: SPARK-15670 > URL: https://issues.apache.org/jira/browse/SPARK-15670 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
[ https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15670. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class > -- > > Key: SPARK-15670 > URL: https://issues.apache.org/jira/browse/SPARK-15670 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Reporter: Weichen Xu >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15680) Disable comments in generated code in order to avoid performance issues
[ https://issues.apache.org/jira/browse/SPARK-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15680. - Resolution: Fixed Fix Version/s: 2.0.0 > Disable comments in generated code in order to avoid performance issues > --- > > Key: SPARK-15680 > URL: https://issues.apache.org/jira/browse/SPARK-15680 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > Attachments: pasted_image_at_2016_05_29_03_02_pm.png, > pasted_image_at_2016_05_29_03_03_pm.png > > > In benchmarks involving tables with very wide and complex schemas (thousands > of columns, deep nesting), I noticed that significant amounts of time (order > of tens of seconds per task) were being spent generating comments during the > code generation phase. > The root cause of the performance problem stems from the fact that calling > {{toString()}} on a complex expression can involve thousands of string > concatenations, resulting in huge amounts (tens of gigabytes) of character > array allocation and copying (see attached profiler screenshots) > In the long term, we can avoid this problem by passing StringBuilders down > the tree and using them to accumulate output. In the short term, however, I > think that we should just disable comments in the generated code by default > since very long comments are typically not useful debugging aids (since > they're truncated for display anyways). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15662) Add since annotation for classes in sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-15662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15662. --- Resolution: Fixed Fix Version/s: 2.0.0 > Add since annotation for classes in sql.catalog > --- > > Key: SPARK-15662 > URL: https://issues.apache.org/jira/browse/SPARK-15662 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15684) Not mask startsWith and endsWith in R
Miao Wang created SPARK-15684: - Summary: Not mask startsWith and endsWith in R Key: SPARK-15684 URL: https://issues.apache.org/jira/browse/SPARK-15684 Project: Spark Issue Type: Improvement Reporter: Miao Wang Priority: Minor In R 3.3.0, it has startsWith and endsWith. We should not mask this two methods in Spark. Actually, Spark R has startsWith and endsWith working for column. But making them work for both column and string is not easy. I create this JIRA for discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
[ https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308950#comment-15308950 ] Bryan Cutler commented on SPARK-12666: -- This seems like it is more of an issue with SBT publishLocal, as it publishes an ivy.xml file without a "default" conf. Like this, for example {noformat} http://ant.apache.org/ivy/extra;> ... ... {noformat} If this file had a conf name="default", or if there were no confs defined an implicit "default" would be used. But since it is missing, there is an error when resolving. I think it's reasonable for Spark to expect a "default" conf to exist, but we can make it a bit more robust to handle cases like this where it doesn't exist. I'll post a PR for this. > spark-shell --packages cannot load artifacts which are publishLocal'd by SBT > > > Key: SPARK-12666 > URL: https://issues.apache.org/jira/browse/SPARK-12666 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.1, 1.6.0 >Reporter: Josh Rosen > > Symptom: > I cloned the latest master of {{spark-redshift}}, then used {{sbt > publishLocal}} to publish it to my Ivy cache. When I tried running > {{./bin/spark-shell --packages > com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency > into {{spark-shell}}, I received the following cryptic error: > {code} > Exception in thread "main" java.lang.RuntimeException: [unresolved > dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration > not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It > was required from org.apache.spark#spark-submit-parent;1.0 default] > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > I think the problem here is that Spark is declaring a dependency on the > spark-redshift artifact using the {{default}} Ivy configuration. Based on my > admittedly limited understanding of Ivy, the default configuration will be > the only configuration defined in an Ivy artifact if that artifact defines no > other configurations. Thus, for Maven artifacts I think the default > configuration will end up mapping to Maven's regular JAR dependency (i.e. > Maven artifacts don't declare Ivy configurations so they implicitly have the > {{default}} configuration) but for Ivy artifacts I think we can run into > trouble when loading artifacts which explicitly define their own > configurations, since those artifacts might not have a configuration named > {{default}}. > I spent a bit of time playing around with the SparkSubmit code to see if I > could fix this but wasn't able to completely resolve the issue. > /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, > if you'd like) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15684) Not mask startsWith and endsWith in R
[ https://issues.apache.org/jira/browse/SPARK-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308940#comment-15308940 ] Miao Wang commented on SPARK-15684: --- [~yanboliang] Per our discussion, this one is similar to implement read_csv in Spark R and there is no nice solution for this type of issues. Can you comment on this jira? Thanks! > Not mask startsWith and endsWith in R > - > > Key: SPARK-15684 > URL: https://issues.apache.org/jira/browse/SPARK-15684 > Project: Spark > Issue Type: Improvement >Reporter: Miao Wang >Priority: Minor > > In R 3.3.0, it has startsWith and endsWith. We should not mask this two > methods in Spark. Actually, Spark R has startsWith and endsWith working for > column. But making them work for both column and string is not easy. I create > this JIRA for discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7
[ https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-15451. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.0.0 > Spark PR builder should fail if code doesn't compile against JDK 7 > -- > > Key: SPARK-15451 > URL: https://issues.apache.org/jira/browse/SPARK-15451 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.0.0 > > > We need to compile certain parts of the build using jdk8, so that we test > things like lambdas. But when possible, we should either compile using jdk7, > or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in > jdk8-specific library calls. > I'll take a look at fixing the maven / sbt files, but I'm not sure how to > update the PR builders since this will most probably require at least a new > env variable (to say where jdk7 is). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11474) Options to jdbc load are lower cased
[ https://issues.apache.org/jira/browse/SPARK-11474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11474: Component/s: (was: Input/Output) SQL > Options to jdbc load are lower cased > > > Key: SPARK-11474 > URL: https://issues.apache.org/jira/browse/SPARK-11474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Linux & Mac >Reporter: Stephen Samuel >Assignee: Huaxin Gao >Priority: Minor > Fix For: 1.5.3, 1.6.0 > > > We recently upgraded from spark 1.3.0 to 1.5.1 and one of the features we > wanted to take advantage of was the fetchSize added to the jdbc data frame > reader. > In 1.5.1 there appears to be a bug or regression, whereby an options map has > its keys lowercased. This means the existing properties from prior to 1.4 are > ok, such as dbtable, url and driver, but the newer fetchSize gets converted > to fetchsize. > To re-produce: > val conf = new SparkConf(true).setMaster("local").setAppName("fetchtest") > val sc = new SparkContext(conf) > val sql = new SQLContext(sc) > val options = Map("url" -> , "driver" -> , "fetchSize" -> ) > val df = sql.load("jdbc", options) > Breakpoint at line 371 in JDBCRDD and you'll see the options are all > lowercased, so: > val fetchSize = properties.getProperty("fetchSize", "0").toInt > results in 0 > Now I know sql.load is deprecated, but this might be occuring on other > methods too. The workaround is to use the java.util.Properties overload, > which keeps the case sensitive keys. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15545) R remove non-exported unused methods, like jsonRDD
[ https://issues.apache.org/jira/browse/SPARK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308892#comment-15308892 ] Miao Wang commented on SPARK-15545: --- I will give a try. Thanks > R remove non-exported unused methods, like jsonRDD > -- > > Key: SPARK-15545 > URL: https://issues.apache.org/jira/browse/SPARK-15545 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Priority: Minor > > Need to review what should be removed. > one reason to not remove this right away is because we have been talking > about calling internal methods via `SparkR:::jsonRDD` for this and other RDD > methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15654) Reading gzipped files results in duplicate rows
[ https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307043#comment-15307043 ] Takeshi Yamamuro edited comment on SPARK-15654 at 5/31/16 11:10 PM: Seems a root cause is that LineRecordReader cannot split files compressed by some codecs. hadoop-v2.8+ throws an exception if these kinds of files are passed into LineRecordReader (https://github.com/apache/hadoop/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L103) This is fixed in MAPREDUCE-2094. One solution is to set Long.MaxValue at defaultMaxSplitBytes in FileSourceStrategy if unsplittable files detected there. https://github.com/apache/spark/compare/master...maropu:SPARK-15654 was (Author: maropu): Seems a root cause is that LineRecordReader cannot split files compressed by some codecs. hadoop-v2.8+ throws an exception if these kinds of files are passed into LineRecordReader (https://github.com/apache/hadoop/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L103) This is fixed in MAPREDUCE-2094. One solution is to set Long.MaxValue at defaultMaxSplitBytes in FileSourceStrategy if splittable files detected there. https://github.com/apache/spark/compare/master...maropu:SPARK-15654 > Reading gzipped files results in duplicate rows > --- > > Key: SPARK-15654 > URL: https://issues.apache.org/jira/browse/SPARK-15654 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jurriaan Pruis >Priority: Blocker > > When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} > reading the file will result in duplicate rows in the dataframe. > Given an example gzipped wordlist (of 740K bytes): > {code} > $ gzcat words.gz |wc -l > 235886 > {code} > Reading it using spark results in the following output: > {code} > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 81244093 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 8348566 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 1051469 > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count() > 235886 > {code} > You can clearly see how the number of rows scales with the number of > partitions. > Somehow the data is duplicated when the number of partitions exceeds one > (which as seen above approximately scales with the partition size). > When using distinct you'll get the correct answer: > {code} > >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1') > >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count() > 235886 > {code} > This looks like a pretty serious bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308824#comment-15308824 ] Apache Spark commented on SPARK-6320: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/13426 > Adding new query plan strategy to SQLContext > > > Key: SPARK-6320 > URL: https://issues.apache.org/jira/browse/SPARK-6320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Youssef Hatem >Priority: Minor > > Hi, > I would like to add a new strategy to {{SQLContext}}. To do this I created a > new class which extends {{Strategy}}. In my new class I need to call > {{planLater}} function. However this method is defined in {{SparkPlanner}} > (which itself inherits the method from {{QueryPlanner}}). > To my knowledge the only way to make {{planLater}} function visible to my new > strategy is to define my strategy inside another class that extends > {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will > have to extend the {{SQLContext}} such that I can override the {{planner}} > field with the new {{Planner}} class I created. > It seems that this is a design problem because adding a new strategy seems to > require extending {{SQLContext}} (unless I am doing it wrong and there is a > better way to do it). > Thanks a lot, > Youssef -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308815#comment-15308815 ] Apache Spark commented on SPARK-15441: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13425 > dataset outer join seems to return incorrect result > --- > > Key: SPARK-15441 > URL: https://issues.apache.org/jira/browse/SPARK-15441 > Project: Spark > Issue Type: Sub-task > Components: sq; >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Critical > > See notebook > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html > {code} > import org.apache.spark.sql.functions > val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS() > val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS() > // The last row _1 should be null, rather than (null, -1) > left.toDF("k", "v").as[(String, Int)].alias("left") > .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), > functions.col("left.k") === functions.col("right.k"), "right_outer") > .show() > {code} > The returned result currently is > {code} > +-+-+ > | _1| _2| > +-+-+ > |(a,2)|(a,x)| > |(a,1)|(a,x)| > |(b,3)|(b,y)| > |(null,-1)|(d,z)| > +-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308797#comment-15308797 ] Nick Pentreath commented on SPARK-15447: Created a Google sheet with initial results: https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing So far for SPARK-6717 I've just used {{spark-perf}} to compare the RDD-based APIs (as the checkpointing only impacts the RDD-based {{train}} method). From these results no red flags, and 2.0 is actually faster in general relative to 1.6. Checkpointing does add a minor overhead (but this overhead is consistent across the versions and again better in 2.0). There is something a little weird about the 1.6 results for 10m ratings case, but not sure what's going on there - I've rerun a few times with the same result. Also, haven't managed to get to 1b ratings yet due to cluster size, will keep working on it. > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15557: --- Assignee: Dilip Biswal > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15557. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13368 [https://github.com/apache/spark/pull/13368] > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai > Fix For: 2.0.0 > > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15489: Assignee: Apache Spark > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela >Assignee: Apache Spark > > When setting a custom "spark.kryo.registrator" (or any other configuration > for that matter) through the API, this configuration will not propagate to > the encoder that uses a KryoSerializer since it instantiates with "new > SparkConf()". > See: > https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554 > This could be hacked by providing those configurations as System properties, > but this probably should be passed to the encoder and set in the > SerializerInstance after creation. > Example: > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15489: Assignee: (was: Apache Spark) > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When setting a custom "spark.kryo.registrator" (or any other configuration > for that matter) through the API, this configuration will not propagate to > the encoder that uses a KryoSerializer since it instantiates with "new > SparkConf()". > See: > https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554 > This could be hacked by providing those configurations as System properties, > but this probably should be passed to the encoder and set in the > SerializerInstance after creation. > Example: > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308794#comment-15308794 ] Apache Spark commented on SPARK-15489: -- User 'amitsela' has created a pull request for this issue: https://github.com/apache/spark/pull/13424 > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When setting a custom "spark.kryo.registrator" (or any other configuration > for that matter) through the API, this configuration will not propagate to > the encoder that uses a KryoSerializer since it instantiates with "new > SparkConf()". > See: > https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554 > This could be hacked by providing those configurations as System properties, > but this probably should be passed to the encoder and set in the > SerializerInstance after creation. > Example: > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15076) Add ReorderAssociativeOperator optimizer
[ https://issues.apache.org/jira/browse/SPARK-15076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15076: -- Description: This issue add a new optimizer `ReorderAssociativeOperator` by taking advantage of integral associative property. Currently, Spark works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This issue can handle Case 2 for *Add/Multiply* expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparison between `before` and `after` this issue. **Before** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} **After** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} was: This issue add a new optimizer `ConstantFolding` by taking advantage of integral associative property. Currently, `ConstantFolding` works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparision between `before` and `after` this issue. **Before** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} **After** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} > Add ReorderAssociativeOperator optimizer > > > Key: SPARK-15076 > URL: https://issues.apache.org/jira/browse/SPARK-15076 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Reporter: Dongjoon Hyun > > This issue add a new optimizer `ReorderAssociativeOperator` by taking > advantage of integral associative property. Currently, Spark works like the > following. > 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. > 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. > This issue can handle Case 2 for *Add/Multiply* expression whose data types > are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings > are the plan comparison between `before` and `after` this issue. > **Before** > {code} > scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) > a)").explain > == Physical Plan == > WholeStageCodegen > : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS > (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] > : +- INPUT > +- Generate explode([1]), false, false, [a#7] >+- Scan OneRowRelation[] > {code} > **After** > {code} > scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) > a)").explain > == Physical Plan == > WholeStageCodegen > : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + > 8) + 9)#8] > : +- INPUT > +- Generate explode([1]), false, false, [a#7] >+- Scan OneRowRelation[] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15076) Add ReorderAssociativeOperator optimizer
[ https://issues.apache.org/jira/browse/SPARK-15076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15076: -- Summary: Add ReorderAssociativeOperator optimizer (was: Improve ConstantFolding optimizer to use integral associative property) > Add ReorderAssociativeOperator optimizer > > > Key: SPARK-15076 > URL: https://issues.apache.org/jira/browse/SPARK-15076 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Reporter: Dongjoon Hyun > > This issue improves `ConstantFolding` optimizer by taking advantage of > integral associative property. Currently, `ConstantFolding` works like the > following. > 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. > 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. > This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* > expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and > `LongType`. The followings are the plan comparision between `before` and > `after` this issue. > **Before** > {code} > scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) > a)").explain > == Physical Plan == > WholeStageCodegen > : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS > (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] > : +- INPUT > +- Generate explode([1]), false, false, [a#7] >+- Scan OneRowRelation[] > {code} > **After** > {code} > scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) > a)").explain > == Physical Plan == > WholeStageCodegen > : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + > 8) + 9)#8] > : +- INPUT > +- Generate explode([1]), false, false, [a#7] >+- Scan OneRowRelation[] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15076) Add ReorderAssociativeOperator optimizer
[ https://issues.apache.org/jira/browse/SPARK-15076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15076: -- Description: This issue add a new optimizer `ConstantFolding` by taking advantage of integral associative property. Currently, `ConstantFolding` works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparision between `before` and `after` this issue. **Before** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} **After** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} was: This issue improves `ConstantFolding` optimizer by taking advantage of integral associative property. Currently, `ConstantFolding` works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparision between `before` and `after` this issue. **Before** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} **After** {code} scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] {code} > Add ReorderAssociativeOperator optimizer > > > Key: SPARK-15076 > URL: https://issues.apache.org/jira/browse/SPARK-15076 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Reporter: Dongjoon Hyun > > This issue add a new optimizer `ConstantFolding` by taking advantage of > integral associative property. Currently, `ConstantFolding` works like the > following. > 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. > 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. > This issue enables `ConstantFolding` to handle Case 2 for *Add/Multiply* > expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and > `LongType`. The followings are the plan comparision between `before` and > `after` this issue. > **Before** > {code} > scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) > a)").explain > == Physical Plan == > WholeStageCodegen > : +- Project [(a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS > (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] > : +- INPUT > +- Generate explode([1]), false, false, [a#7] >+- Scan OneRowRelation[] > {code} > **After** > {code} > scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) > a)").explain > == Physical Plan == > WholeStageCodegen > : +- Project [(a#7 + 45) AS (a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + > 8) + 9)#8] > : +- INPUT > +- Generate explode([1]), false, false, [a#7] >+- Scan OneRowRelation[] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15327) Catalyst code generation fails with complex data structure
[ https://issues.apache.org/jira/browse/SPARK-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15327. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13235 [https://github.com/apache/spark/pull/13235] > Catalyst code generation fails with complex data structure > -- > > Key: SPARK-15327 > URL: https://issues.apache.org/jira/browse/SPARK-15327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jurriaan Pruis >Assignee: Davies Liu > Fix For: 2.0.0 > > Attachments: full_exception.txt > > > Spark code generation fails with the following error when loading parquet > files with a complex structure: > {code} > : java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 158, Column 16: Expression "scan_isNull" is not an > rvalue > {code} > The generated code on line 158 looks like: > {code} > /* 153 */ this.scan_arrayWriter23 = new > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); > /* 154 */ this.scan_rowWriter40 = new > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder, > 1); > /* 155 */ } > /* 156 */ > /* 157 */ private void scan_apply_0(InternalRow scan_row) { > /* 158 */ if (scan_isNull) { > /* 159 */ scan_rowWriter.setNullAt(0); > /* 160 */ } else { > /* 161 */ // Remember the current cursor so that we can calculate how > many bytes are > /* 162 */ // written later. > /* 163 */ final int scan_tmpCursor = scan_holder.cursor; > /* 164 */ > {code} > How to reproduce (Pyspark): > {code} > # Some complex structure > json = '{"h": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", "count": > 3}], "b": [{"e": "test", "count": 1}]}}, "d": {"b": {"c": [{"e": "adfgd"}], > "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "c": > {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", > "count": 1}]}}, "a": {"b": {"c": [{"e": "adfgd"}], "a": [{"count": 3}], "b": > [{"e": "test", "count": 1}]}}, "e": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": > "testing", "count": 3}], "b": [{"e": "test", "count": 1}]}}, "g": {"b": {"c": > [{"e": "adfgd"}], "a": [{"e": "testing", "count": 3}], "b": [{"e": "test", > "count": 1}]}}, "f": {"b": {"c": [{"e": "adfgd"}], "a": [{"e": "testing", > "count": 3}], "b": [{"e": "test", "count": 1}]}}, "b": {"b": {"c": [{"e": > "adfgd"}], "a": [{"count": 3}], "b": [{"e": "test", "count": 1}]}}}' > # Write to parquet file > sqlContext.read.json(sparkContext.parallelize([json])).write.mode('overwrite').parquet('test') > # Try to read from parquet file (this generates an exception) > sqlContext.read.parquet('test').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6859. --- Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.0.0 Fixed by upgrading parquet-mr to 1.8.1. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Assignee: Ryan Blue >Priority: Minor > Fix For: 2.0.0 > > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308754#comment-15308754 ] Cheng Lian commented on SPARK-6859: --- Yea, thanks. I'm closing it. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > Fix For: 2.0.0 > > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15451) Spark PR builder should fail if code doesn't compile against JDK 7
[ https://issues.apache.org/jira/browse/SPARK-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308703#comment-15308703 ] Sean Owen commented on SPARK-15451: --- Yeah this has come up a few times in Spark -- there's a duplicate that I won't even bother to find and link here. Classic, weird cross compilation issue. > Spark PR builder should fail if code doesn't compile against JDK 7 > -- > > Key: SPARK-15451 > URL: https://issues.apache.org/jira/browse/SPARK-15451 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > We need to compile certain parts of the build using jdk8, so that we test > things like lambdas. But when possible, we should either compile using jdk7, > or provide jdk7's rt.jar to javac. Otherwise it's way too easy to slip in > jdk8-specific library calls. > I'll take a look at fixing the maven / sbt files, but I'm not sure how to > update the PR builders since this will most probably require at least a new > env variable (to say where jdk7 is). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11293) ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11293: --- Summary: ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods (was: Spillable collections leak shuffle memory) > ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their > stop() methods > --- > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13484) Filter outer joined result using a non-nullable column from the right table
[ https://issues.apache.org/jira/browse/SPARK-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308734#comment-15308734 ] Apache Spark commented on SPARK-13484: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13290 > Filter outer joined result using a non-nullable column from the right table > --- > > Key: SPARK-13484 > URL: https://issues.apache.org/jira/browse/SPARK-13484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0, 2.0.0 >Reporter: Xiangrui Meng > > Technically speaking, this is not a bug. But > {code} > val a = sqlContext.range(10).select(col("id"), lit(0).as("count")) > val b = sqlContext.range(10).select((col("id") % > 3).as("id")).groupBy("id").count() > a.join(b, a("id") === b("id"), "left_outer").filter(b("count").isNull).show() > {code} > returns nothing. This is because `b("count")` is not nullable and the filter > condition is always false by static analysis. However, it is common for users > to use `a(...)` and `b(...)` to filter the joined result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11293: --- Affects Version/s: (was: 1.6.1) > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11293. Resolution: Fixed Fix Version/s: 1.6.0 This issue was fixed in 1.6.0 as part of the patch for SPARK-10984 (my original PR for that issue was subsumed by the PR for SPARK-10984). We will not backport this patch to 1.5.x or 1.4.x because we do not plan to have further non-security-fix releases for those branches. > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-11293: -- Assignee: Josh Rosen (was: Davies Liu) > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org