[jira] [Assigned] (SPARK-18425) Test `CompactibleFileStreamLog` directly
[ https://issues.apache.org/jira/browse/SPARK-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18425: Assignee: (was: Apache Spark) > Test `CompactibleFileStreamLog` directly > > > Key: SPARK-18425 > URL: https://issues.apache.org/jira/browse/SPARK-18425 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.0.1 >Reporter: Liwei Lin >Priority: Minor > > Right now we are testing {{CompactibleFileStreamLog}} in > {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only > subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more). > Let's do some refactor so that {{CompactibleFileStreamLog}} is directly > tested, making future changes to {{CompactibleFileStreamLog}} much easier to > test and much easier to review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18425) Test `CompactibleFileStreamLog` directly
[ https://issues.apache.org/jira/browse/SPARK-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660880#comment-15660880 ] Apache Spark commented on SPARK-18425: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/15870 > Test `CompactibleFileStreamLog` directly > > > Key: SPARK-18425 > URL: https://issues.apache.org/jira/browse/SPARK-18425 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.0.1 >Reporter: Liwei Lin >Priority: Minor > > Right now we are testing {{CompactibleFileStreamLog}} in > {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only > subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more). > Let's do some refactor so that {{CompactibleFileStreamLog}} is directly > tested, making future changes to {{CompactibleFileStreamLog}} much easier to > test and much easier to review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18425) Test `CompactibleFileStreamLog` directly
[ https://issues.apache.org/jira/browse/SPARK-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18425: Assignee: Apache Spark > Test `CompactibleFileStreamLog` directly > > > Key: SPARK-18425 > URL: https://issues.apache.org/jira/browse/SPARK-18425 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 2.0.1 >Reporter: Liwei Lin >Assignee: Apache Spark >Priority: Minor > > Right now we are testing {{CompactibleFileStreamLog}} in > {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only > subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more). > Let's do some refactor so that {{CompactibleFileStreamLog}} is directly > tested, making future changes to {{CompactibleFileStreamLog}} much easier to > test and much easier to review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18425) Test `CompactibleFileStreamLog` directly
Liwei Lin created SPARK-18425: - Summary: Test `CompactibleFileStreamLog` directly Key: SPARK-18425 URL: https://issues.apache.org/jira/browse/SPARK-18425 Project: Spark Issue Type: Test Components: Structured Streaming, Tests Affects Versions: 2.0.1 Reporter: Liwei Lin Priority: Minor Right now we are testing {{CompactibleFileStreamLog}} in {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more). Let's do some refactor so that {{CompactibleFileStreamLog}} is directly tested, making future changes to {{CompactibleFileStreamLog}} much easier to test and much easier to review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660837#comment-15660837 ] Apache Spark commented on SPARK-18413: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15868 > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18413: Assignee: Apache Spark > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin >Assignee: Apache Spark > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18413: Assignee: (was: Apache Spark) > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18419) Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys
[ https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18419: -- Summary: Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys (was: Fix JDBCOptions.asConnectionProperties to be case-insensitive ) > Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys > -- > > Key: SPARK-18419 > URL: https://issues.apache.org/jira/browse/SPARK-18419 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` > correctly. For the following case, it returns `Map('numpartitions' -> "10")` > as a wrong result. > {code} > val options = new JDBCOptions(new CaseInsensitiveMap(Map( > "url" -> "jdbc:mysql://localhost:3306/temp", > "dbtable" -> "t1", > "numPartitions" -> "10"))) > assert(options.asConnectionProperties.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18419) Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys
[ https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18419: -- Description: This issue aims to fix the followings. **A. Fix `JDBCOptions.asConnectionProperties` to be case-insensitive.** `JDBCOptions.asConnectionProperties` is designed to filter JDBC options out, but it fails to handle `CaseInsensitiveMap` correctly. For the following example, it returns `Map('numpartitions' -> "10")` as a wrong result and the assertion fails. {code} val options = new JDBCOptions(new CaseInsensitiveMap(Map( "url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))) assert(options.asConnectionProperties.isEmpty) {code} **B. Fix `DataSource` to use `CaseInsensitiveMap` consistently.** `DataSource` partially use `CaseInsensitiveMap` in code-path. For example, the following fails to find `url`. {code} val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2) df.write.format("jdbc") .option("URL", url1) .option("dbtable", "TEST.SAVETEST") .options(properties.asScala) .save() {code} was: `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` correctly. For the following case, it returns `Map('numpartitions' -> "10")` as a wrong result. {code} val options = new JDBCOptions(new CaseInsensitiveMap(Map( "url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))) assert(options.asConnectionProperties.isEmpty) {code} > Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys > -- > > Key: SPARK-18419 > URL: https://issues.apache.org/jira/browse/SPARK-18419 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to fix the followings. > **A. Fix `JDBCOptions.asConnectionProperties` to be case-insensitive.** > `JDBCOptions.asConnectionProperties` is designed to filter JDBC options out, > but it fails to handle `CaseInsensitiveMap` correctly. For the following > example, it returns `Map('numpartitions' -> "10")` as a wrong result and the > assertion fails. > {code} > val options = new JDBCOptions(new CaseInsensitiveMap(Map( > "url" -> "jdbc:mysql://localhost:3306/temp", > "dbtable" -> "t1", > "numPartitions" -> "10"))) > assert(options.asConnectionProperties.isEmpty) > {code} > **B. Fix `DataSource` to use `CaseInsensitiveMap` consistently.** > `DataSource` partially use `CaseInsensitiveMap` in code-path. For example, > the following fails to find `url`. > {code} > val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2) > df.write.format("jdbc") > .option("URL", url1) > .option("dbtable", "TEST.SAVETEST") > .options(properties.asScala) > .save() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.
[ https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18418: --- Fix Version/s: 2.2.0 > Make release script hadoop profiles aren't correctly specified. > --- > > Key: SPARK-18418 > URL: https://issues.apache.org/jira/browse/SPARK-18418 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: holdenk >Assignee: holdenk >Priority: Critical > Fix For: 2.1.0, 2.2.0 > > > Split from https://github.com/apache/spark/pull/15659/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.
[ https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-18418. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15860 [https://github.com/apache/spark/pull/15860] > Make release script hadoop profiles aren't correctly specified. > --- > > Key: SPARK-18418 > URL: https://issues.apache.org/jira/browse/SPARK-18418 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: holdenk >Assignee: holdenk >Priority: Critical > Fix For: 2.1.0 > > > Split from https://github.com/apache/spark/pull/15659/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.
[ https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18418: --- Assignee: holdenk Affects Version/s: 2.1.0 Target Version/s: 2.1.0, 2.2.0 Priority: Critical (was: Major) > Make release script hadoop profiles aren't correctly specified. > --- > > Key: SPARK-18418 > URL: https://issues.apache.org/jira/browse/SPARK-18418 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: holdenk >Assignee: holdenk >Priority: Critical > > Split from https://github.com/apache/spark/pull/15659/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.
[ https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660416#comment-15660416 ] Josh Rosen commented on SPARK-18418: For reference: this patch fixes a bug which was introduced in SPARK-16967 and affects both {{master}} and {{branch-2.1}}. > Make release script hadoop profiles aren't correctly specified. > --- > > Key: SPARK-18418 > URL: https://issues.apache.org/jira/browse/SPARK-18418 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: holdenk > > Split from https://github.com/apache/spark/pull/15659/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.
[ https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18418: --- Component/s: Project Infra > Make release script hadoop profiles aren't correctly specified. > --- > > Key: SPARK-18418 > URL: https://issues.apache.org/jira/browse/SPARK-18418 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Reporter: holdenk > > Split from https://github.com/apache/spark/pull/15659/files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291 ] Bill Chambers edited comment on SPARK-18424 at 11/12/16 10:09 PM: -- For the record I would like to work on this one. Define Function here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Register Function here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Add tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala was (Author: bill_chambers): For the record I would like to work on this one. Define Function here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Register Function here: ? Add tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291 ] Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:31 PM: - For the record I would like to work on this one. Define Function here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Register Function here: ? Add tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala was (Author: bill_chambers): For the record I would like to work on this one. It seems that I will have to add some tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291 ] Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:30 PM: - For the record I would like to work on this one. It seems that I will have to add some tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala was (Author: bill_chambers): For the record I would like to work on this one. > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291 ] Bill Chambers commented on SPARK-18424: --- For the record I would like to work on this one. > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Improve Date Parsing Functionality (was: Cumbersome Date Manipulation) > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. so that you can avoid entirely the > above conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Description: I've found it quite cumbersome to work with dates thus far in Spark, it can be hard to reason about the timeformat and what type you're working with, for instance: say that I have a date in the format {code} 2017-20-12 // Y-D-M {code} In order to parse that into a Date, I have to perform several conversions. {code} to_date( unix_timestamp(col("date"), dateFormat) .cast("timestamp")) .alias("date") {code} I propose simplifying this by adding a to_date function (exists) but adding one that accepts a format for that date. I also propose a to_timestamp function that also supports a format. so that you can avoid entirely the above conversion. It's also worth mentioning that many other databases support this. For instance, mysql has the STR_TO_DATE function, netezza supports the to_timestamp semantic. was: I've found it quite cumbersome to work with dates thus far in Spark, it can be hard to reason about the timeformat and what type you're working with, for instance: say that I have a date in the format {code} 2017-20-12 // Y-D-M {code} In order to parse that into a Date, I have to perform several conversions. {code} to_date( unix_timestamp(col("date"), dateFormat) .cast("timestamp")) .alias("date") {code} I propose simplifying this by adding a to_date function (exists) but adding one that accepts a format for that date. so that you can avoid entirely the above conversion. > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18424) Cumbersome Date Manipulation
Bill Chambers created SPARK-18424: - Summary: Cumbersome Date Manipulation Key: SPARK-18424 URL: https://issues.apache.org/jira/browse/SPARK-18424 Project: Spark Issue Type: Improvement Reporter: Bill Chambers Priority: Minor I've found it quite cumbersome to work with dates thus far in Spark, it can be hard to reason about the timeformat and what type you're working with, for instance: say that I have a date in the format {code} 2017-20-12 // Y-D-M {code} In order to parse that into a Date, I have to perform several conversions. {code} to_date( unix_timestamp(col("date"), dateFormat) .cast("timestamp")) .alias("date") {code} I propose simplifying this by adding a to_date function (exists) but adding one that accepts a format for that date. so that you can avoid entirely the above conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started
[ https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18423: Assignee: (was: Apache Spark) > ReceiverTracker should close checkpoint dir when stopped even if it was not > started > --- > > Key: SPARK-18423 > URL: https://issues.apache.org/jira/browse/SPARK-18423 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Hyukjin Kwon > > {code} > Running org.apache.spark.streaming.JavaAPISuite > Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec > <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite > testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time > elapsed: 3.418 sec <<< ERROR! > java.io.IOException: Failed to delete: > C:\projects\spark\streaming\target\tmp\1474255953021-0 > at > org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) > {code} > {code} > mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 > milliseconds) > [info] java.io.IOException: Failed to delete: > C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 > {code} > These tests seem caused by not closed files in {{ReceiverTracker}}. Please > refer the discussion in > https://github.com/apache/spark/pull/15618#issuecomment-259660817 > Root cause is, it is being created and stopped without starting. In this > case, `RecieverTracker` does not close checkpoint dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started
[ https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660245#comment-15660245 ] Apache Spark commented on SPARK-18423: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15867 > ReceiverTracker should close checkpoint dir when stopped even if it was not > started > --- > > Key: SPARK-18423 > URL: https://issues.apache.org/jira/browse/SPARK-18423 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Hyukjin Kwon > > {code} > Running org.apache.spark.streaming.JavaAPISuite > Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec > <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite > testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time > elapsed: 3.418 sec <<< ERROR! > java.io.IOException: Failed to delete: > C:\projects\spark\streaming\target\tmp\1474255953021-0 > at > org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) > {code} > {code} > mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 > milliseconds) > [info] java.io.IOException: Failed to delete: > C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 > {code} > These tests seem caused by not closed files in {{ReceiverTracker}}. Please > refer the discussion in > https://github.com/apache/spark/pull/15618#issuecomment-259660817 > Root cause is, it is being created and stopped without starting. In this > case, `RecieverTracker` does not close checkpoint dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started
[ https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18423: Assignee: Apache Spark > ReceiverTracker should close checkpoint dir when stopped even if it was not > started > --- > > Key: SPARK-18423 > URL: https://issues.apache.org/jira/browse/SPARK-18423 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > {code} > Running org.apache.spark.streaming.JavaAPISuite > Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec > <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite > testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time > elapsed: 3.418 sec <<< ERROR! > java.io.IOException: Failed to delete: > C:\projects\spark\streaming\target\tmp\1474255953021-0 > at > org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) > {code} > {code} > mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 > milliseconds) > [info] java.io.IOException: Failed to delete: > C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 > {code} > These tests seem caused by not closed files in {{ReceiverTracker}}. Please > refer the discussion in > https://github.com/apache/spark/pull/15618#issuecomment-259660817 > Root cause is, it is being created and stopped without starting. In this > case, `RecieverTracker` does not close checkpoint dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660237#comment-15660237 ] Saikat Kanjilal commented on SPARK-9487: Understood, so I wanted a fresh look at this from a different dev environment, so on my macbook pro I tried changing the setting to local[2] and local[4] for JavaAPISuite, it seems that they both fail so yes mimicing the real Jeankins failure will be hard, should I close this pull request till this is fixed and resubmit a new one, I have no idea at this point how long debugging this or even replicating this will take, thoughts on a suitable set of next steps? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started
[ https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-18423: - Description: {code} Running org.apache.spark.streaming.JavaAPISuite Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time elapsed: 3.418 sec <<< ERROR! java.io.IOException: Failed to delete: C:\projects\spark\streaming\target\tmp\1474255953021-0 at org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) {code} {code} mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 milliseconds) [info] java.io.IOException: Failed to delete: C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 {code} These tests seem caused by not closed files in {{ReceiverTracker}}. Please refer the discussion in https://github.com/apache/spark/pull/15618#issuecomment-259660817 Root cause is, it is being created and stopped without starting. In this case, `RecieverTracker` does not close checkpoint dir. was: {code} Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec <<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite) Time elapsed: 0.047 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<1> at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177) {code} {code} Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time elapsed: 3.418 sec <<< ERROR! java.io.IOException: Failed to delete: C:\projects\spark\streaming\target\tmp\1474255953021-0 at org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) Running org.apache.spark.streaming.JavaDurationSuite {code} {code} Running org.apache.spark.streaming.JavaAPISuite Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time elapsed: 3.418 sec <<< ERROR! java.io.IOException: Failed to delete: C:\projects\spark\streaming\target\tmp\1474255953021-0 at org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) {code} {code} mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 milliseconds) [info] java.io.IOException: Failed to delete: C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 {code} These tests seem caused by not closed files in {{ReceiverTracker}}. Please refer the discussion in https://github.com/apache/spark/pull/15618#issuecomment-259660817 Root cause is, it is being created and stopped without starting. In this case, `RecieverTracker` does not close checkpoint dir. > ReceiverTracker should close checkpoint dir when stopped even if it was not > started > --- > > Key: SPARK-18423 > URL: https://issues.apache.org/jira/browse/SPARK-18423 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Reporter: Hyukjin Kwon > > {code} > Running org.apache.spark.streaming.JavaAPISuite > Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec > <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite > testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time > elapsed: 3.418 sec <<< ERROR! > java.io.IOException: Failed to delete: > C:\projects\spark\streaming\target\tmp\1474255953021-0 > at > org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) > {code} > {code} > mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 > milliseconds) > [info] java.io.IOException: Failed to delete: > C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 > {code} > These tests seem caused by not closed files in {{ReceiverTracker}}. Please > refer
[jira] [Created] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started
Hyukjin Kwon created SPARK-18423: Summary: ReceiverTracker should close checkpoint dir when stopped even if it was not started Key: SPARK-18423 URL: https://issues.apache.org/jira/browse/SPARK-18423 Project: Spark Issue Type: Sub-task Components: DStreams Reporter: Hyukjin Kwon {code} Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec <<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite) Time elapsed: 0.047 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<1> at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177) {code} {code} Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time elapsed: 3.418 sec <<< ERROR! java.io.IOException: Failed to delete: C:\projects\spark\streaming\target\tmp\1474255953021-0 at org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) Running org.apache.spark.streaming.JavaDurationSuite {code} {code} Running org.apache.spark.streaming.JavaAPISuite Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite) Time elapsed: 3.418 sec <<< ERROR! java.io.IOException: Failed to delete: C:\projects\spark\streaming\target\tmp\1474255953021-0 at org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808) {code} {code} mapWithState - basic operations with simple API (7 seconds, 203 milliseconds) [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 milliseconds) [info] java.io.IOException: Failed to delete: C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40 {code} These tests seem caused by not closed files in {{ReceiverTracker}}. Please refer the discussion in https://github.com/apache/spark/pull/15618#issuecomment-259660817 Root cause is, it is being created and stopped without starting. In this case, `RecieverTracker` does not close checkpoint dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite
[ https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18422: Assignee: Apache Spark > Fix wholeTextFiles test to pass on Windows in JavaAPISuite > -- > > Key: SPARK-18422 > URL: https://issues.apache.org/jira/browse/SPARK-18422 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > {code} > Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec > <<< FAILURE! - in org.apache.spark.JavaAPISuite > wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.313 sec <<< > FAILURE! > java.lang.AssertionError: > expected: > but was: > at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) > {code} > The test failure in {{JavaAPISuite}} was due to different path format on > Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite
[ https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660163#comment-15660163 ] Apache Spark commented on SPARK-18422: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15866 > Fix wholeTextFiles test to pass on Windows in JavaAPISuite > -- > > Key: SPARK-18422 > URL: https://issues.apache.org/jira/browse/SPARK-18422 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec > <<< FAILURE! - in org.apache.spark.JavaAPISuite > wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.313 sec <<< > FAILURE! > java.lang.AssertionError: > expected: > but was: > at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) > {code} > The test failure in {{JavaAPISuite}} was due to different path format on > Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite
[ https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18422: Assignee: (was: Apache Spark) > Fix wholeTextFiles test to pass on Windows in JavaAPISuite > -- > > Key: SPARK-18422 > URL: https://issues.apache.org/jira/browse/SPARK-18422 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec > <<< FAILURE! - in org.apache.spark.JavaAPISuite > wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.313 sec <<< > FAILURE! > java.lang.AssertionError: > expected: > but was: > at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) > {code} > The test failure in {{JavaAPISuite}} was due to different path format on > Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite
[ https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-18422: - Component/s: Spark Core > Fix wholeTextFiles test to pass on Windows in JavaAPISuite > -- > > Key: SPARK-18422 > URL: https://issues.apache.org/jira/browse/SPARK-18422 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec > <<< FAILURE! - in org.apache.spark.JavaAPISuite > wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.313 sec <<< > FAILURE! > java.lang.AssertionError: > expected: > but was: > at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) > {code} > The test failure in {{JavaAPISuite}} was due to different path format on > Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite
Hyukjin Kwon created SPARK-18422: Summary: Fix wholeTextFiles test to pass on Windows in JavaAPISuite Key: SPARK-18422 URL: https://issues.apache.org/jira/browse/SPARK-18422 Project: Spark Issue Type: Sub-task Components: Tests Reporter: Hyukjin Kwon Priority: Minor {code} Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec <<< FAILURE! - in org.apache.spark.JavaAPISuite wholeTextFiles(org.apache.spark.JavaAPISuite) Time elapsed: 0.313 sec <<< FAILURE! java.lang.AssertionError: expected: but was: at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) {code} The test failure in {{JavaAPISuite}} was due to different path format on Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18365: -- Description: The documentation for sample is a little unintuitive. It was difficult to understand why I wasn't getting exactly the fraction specified of my total DataFrame rows. The PR clarifies the documentation for Scala, Python, and R to explain that that is expected behavior. (was: The parameter documentation is switched. PR coming shortly.) > Improve Documentation for Sample Methods > > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The documentation for sample is a little unintuitive. It was difficult to > understand why I wasn't getting exactly the fraction specified of my total > DataFrame rows. The PR clarifies the documentation for Scala, Python, and R > to explain that that is expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18365: -- Summary: Improve Documentation for Sample Methods (was: Improve Documentation for Sample Method) > Improve Documentation for Sample Methods > > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659978#comment-15659978 ] Don Drake commented on SPARK-18207: --- Hi, I was able to download a nightly SNAPSHOT release and verify that this resolves the issue for my project. Thanks to everyone who contributed to this fix and getting it merged in a timely manner. > class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-18207 > URL: https://issues.apache.org/jira/browse/SPARK-18207 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Don Drake >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > Attachments: spark-18207.txt > > > I have 2 wide dataframes that contain nested data structures, when I explode > one of the dataframes, it doesn't include records with an empty nested > structure (outer explode not supported). So, I create a similar dataframe > with null values and union them together. See SPARK-13721 for more details > as to why I have to do this. > I was hoping that SPARK-16845 was going to address my issue, but it does not. > I was asked by [~lwlin] to open this JIRA. > I will attach a code snippet that can be pasted into spark-shell that > duplicates my code and the exception. This worked just fine in Spark 1.6.x. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in > stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 > (TID 812, somehost.mydomain.com, executor 8): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18421) Dynamic disk allocation
Aniket Bhatnagar created SPARK-18421: Summary: Dynamic disk allocation Key: SPARK-18421 URL: https://issues.apache.org/jira/browse/SPARK-18421 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.0.1 Reporter: Aniket Bhatnagar Priority: Minor Dynamic allocation feature allows you to add executors and scale computation power. This is great, however, I feel like we also need a way to dynamically scale storage. Currently, if the disk is not able to hold the spilled/shuffle data, the job is aborted (in yarn, the node manager kills the container) causing frustration and loss of time. In deployments like AWS EMR, it is possible to run an agent that add disks on the fly if it sees that the disks are running out of space and it would be great if Spark could immediately start using the added disks just as it does when new executors are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659972#comment-15659972 ] Gaoxiang Liu commented on SPARK-16827: -- ping ping.. > Stop reporting spill metrics as shuffle metrics > --- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Brian Cho > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18420: Assignee: (was: Apache Spark) > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659794#comment-15659794 ] Apache Spark commented on SPARK-18420: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/15865 > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659792#comment-15659792 ] Apache Spark commented on SPARK-18420: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/15864 > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18420: Assignee: Apache Spark > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.1 >Reporter: coneyliu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659788#comment-15659788 ] coneyliu commented on SPARK-18420: -- Fix the compile errors caused by checkstyle > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18420) Fix the compile errors caused by checkstyle
coneyliu created SPARK-18420: Summary: Fix the compile errors caused by checkstyle Key: SPARK-18420 URL: https://issues.apache.org/jira/browse/SPARK-18420 Project: Spark Issue Type: Improvement Affects Versions: 2.0.1 Reporter: coneyliu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659688#comment-15659688 ] Apache Spark commented on SPARK-18419: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15863 > Fix JDBCOptions.asConnectionProperties to be case-insensitive > -- > > Key: SPARK-18419 > URL: https://issues.apache.org/jira/browse/SPARK-18419 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` > correctly. For the following case, it returns `Map('numpartitions' -> "10")` > as a wrong result. > {code} > val options = new JDBCOptions(new CaseInsensitiveMap(Map( > "url" -> "jdbc:mysql://localhost:3306/temp", > "dbtable" -> "t1", > "numPartitions" -> "10"))) > assert(options.asConnectionProperties.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18419: Assignee: (was: Apache Spark) > Fix JDBCOptions.asConnectionProperties to be case-insensitive > -- > > Key: SPARK-18419 > URL: https://issues.apache.org/jira/browse/SPARK-18419 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` > correctly. For the following case, it returns `Map('numpartitions' -> "10")` > as a wrong result. > {code} > val options = new JDBCOptions(new CaseInsensitiveMap(Map( > "url" -> "jdbc:mysql://localhost:3306/temp", > "dbtable" -> "t1", > "numPartitions" -> "10"))) > assert(options.asConnectionProperties.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18419: Assignee: Apache Spark > Fix JDBCOptions.asConnectionProperties to be case-insensitive > -- > > Key: SPARK-18419 > URL: https://issues.apache.org/jira/browse/SPARK-18419 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` > correctly. For the following case, it returns `Map('numpartitions' -> "10")` > as a wrong result. > {code} > val options = new JDBCOptions(new CaseInsensitiveMap(Map( > "url" -> "jdbc:mysql://localhost:3306/temp", > "dbtable" -> "t1", > "numPartitions" -> "10"))) > assert(options.asConnectionProperties.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659686#comment-15659686 ] Philip Adetiloye edited comment on SPARK-18363 at 11/12/16 1:47 PM: duplicate graph vertice ID causes this issue was (Author: pkadetiloye): duplicated graph vertice ID causes this issue > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye closed SPARK-18363. Resolution: Resolved duplicated graph vertice ID causes this issue > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive
Dongjoon Hyun created SPARK-18419: - Summary: Fix JDBCOptions.asConnectionProperties to be case-insensitive Key: SPARK-18419 URL: https://issues.apache.org/jira/browse/SPARK-18419 Project: Spark Issue Type: Bug Components: SQL Reporter: Dongjoon Hyun Priority: Minor `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` correctly. For the following case, it returns `Map('numpartitions' -> "10")` as a wrong result. {code} val options = new JDBCOptions(new CaseInsensitiveMap(Map( "url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))) assert(options.asConnectionProperties.isEmpty) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659590#comment-15659590 ] Aditya commented on SPARK-17116: I dont get any error when I try to use the string as key. here is my code: lr=LogisticRegression(maxIter=10) model=lr.fit(final,{"maxIter":5}) Is the issue solved? > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxIter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18400) NPE when resharding Kinesis Stream
[ https://issues.apache.org/jira/browse/SPARK-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659421#comment-15659421 ] Sean Owen commented on SPARK-18400: --- OK, open a pull request with that change? > NPE when resharding Kinesis Stream > -- > > Key: SPARK-18400 > URL: https://issues.apache.org/jira/browse/SPARK-18400 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.2 > Environment: Spark 1.6 streaming from AWS Kinesis >Reporter: Brian ONeill >Priority: Minor > > Occasionally, we see an NPE when we reshard our streams: > {code} > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106) > ~[?:1.8.0_60] > at > java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097) > ~[?:1.8.0_60] > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66) > ~[spark-streaming-kinesis-asl_2.11-1.6.4-SNAPSHOT.jar:1.6.4-SNAPSHOT] > at > org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:245) > ~[spark-streaming-kinesis-asl_2.11-1.6.4-SNAPSHOT.jar:1.6.4-SNAPSHOT] > at > org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124) > ~[spark-streaming-kinesis-asl_2.11-1.6.4-SNAPSHOT.jar:1.6.4-SNAPSHOT] > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48) > ~[amazon-kinesis-client-1.6.2.jar:?] > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:100) > [amazon-kinesis-client-1.6.2.jar:?] > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49) > [amazon-kinesis-client-1.6.2.jar:?] > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24) > [amazon-kinesis-client-1.6.2.jar:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_60] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18213) Syntactic sugar over Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-18213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18213. --- Resolution: Won't Fix > Syntactic sugar over Pipeline API > - > > Key: SPARK-18213 > URL: https://issues.apache.org/jira/browse/SPARK-18213 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.1 >Reporter: Wojciech Szymanski >Priority: Minor > > Currently, creating ML Pipeline is based on very verbose setStages method as > below: > {code} > val tokenizer = new RegexTokenizer() > val stopWordsRemover = new StopWordsRemover() > val countVectorizer = new CountVectorizer() > val pipeline = new Pipeline().setStages(Array(tokenizer, > stopWordsRemover, countVectorizer)) > {code} > What about a bit of syntactic sugar over Pipeline API? > {code} > val tokenizer = new RegexTokenizer() > val stopWordsRemover = new StopWordsRemover() > val countVectorizer = new CountVectorizer() > val pipeline = tokenizer + stopWordsRemover + countVectorizer > {code} > Production code changes in > mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala: > https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-5226e84dea43423760dc6300ddafb01b > Scala example: > https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-798e85dd9107565fabab1126f57e3d6e > Java example: > https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-69ac857220f21b5e168d80d6dffe > Thanks in advance for your feedback. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18402) spark: SAXParseException while writing from json to parquet on s3
[ https://issues.apache.org/jira/browse/SPARK-18402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18402. --- Resolution: Not A Problem OK, closing it on this end until there's a Spark-side action to take. > spark: SAXParseException while writing from json to parquet on s3 > - > > Key: SPARK-18402 > URL: https://issues.apache.org/jira/browse/SPARK-18402 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.2, 2.0.1 > Environment: spark 2.0.1 hadoop 2.7.1 > hadoop aws 2.7.1 > ubuntu 14.04.5 on aws > mesos 1.0.1 > Java 1.7.0_111, openjdk >Reporter: Luke Miner > > I'm trying to read in some json, infer a schema, and write it out again as > parquet to s3 (s3a). For some reason, about a third of the way through the > writing portion of the run, spark always errors out with the error included > below. > I can't find any obvious reasons for the issue: > - it isn't out of memory and I have tried increasing the overhead memory > - there are no long GC pauses. > - There don't seem to be any additional error messages in the logs of the > individual executors. > - This does not appear to be a problem with badly formed json or corrupted > files. I have unzipped and read in each file individually with no error. > The script runs fine on another set of data that I have, which is of a very > similar structure, but several orders of magnitude smaller. > I am using the FileOutputCommitter. The algorithm version doesn't seem to > matter. > Here's a simplified version of the script: > {code} > object Foo { > def parseJson(json: String): Option[Map[String, Any]] = { > if (json == null) > Some(Map()) > else > parseOpt(json).map((j: JValue) => j.values.asInstanceOf[Map[String, > Any]]) > } > } > } > // read in as text and parse json using json4s > val jsonRDD: RDD[String] = sc.textFile(inputPath) > .map(row -> Foo.parseJson(row)) > // infer a schema that will encapsulate the most rows in a sample of size > sampleRowNum > val schema: StructType = Infer.getMostCommonSchema(sc, jsonRDD, > sampleRowNum) > // get documents compatibility with schema > val jsonWithCompatibilityRDD: RDD[(String, Boolean)] = jsonRDD > .map(js => (js, Infer.getSchemaCompatibility(schema, > Infer.inferSchema(js)).toBoolean)) > .repartition(partitions) > val jsonCompatibleRDD: RDD[String] = jsonWithCompatibilityRDD > .filter { case (js: String, compatible: Boolean) => compatible } > .map { case (js: String, _: Boolean) => js } > // create a dataframe from documents with compatible schema > val dataFrame: DataFrame = > spark.read.schema(schema).json(jsonCompatibleRDD) > dataFrame.write.parquet("s3a://foo/foo") > {code} > It completes the earlier schema inferring steps successfully. The error > itself occurs on the last line, but I suppose that could encompass at least > the immediately preceding statement, if not earlier: > {code} > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Failed to commit task > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:275) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:257) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at >
[jira] [Resolved] (SPARK-18354) Memory Leak in SQLListener and JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-18354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18354. --- Resolution: Not A Problem > Memory Leak in SQLListener and JobProgressListener > -- > > Key: SPARK-18354 > URL: https://issues.apache.org/jira/browse/SPARK-18354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Cong Tam > Attachments: Leak_Suspects.zip, screenshot-1.png > > > There might be memory leak in SQLListener and JobProgressListener classes > while running Spark SQL. > Please find attachment leak suspect report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659416#comment-15659416 ] Sean Owen commented on SPARK-18356: --- CC [~josephkb] as this was a follow up to your comment at http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-Resolution-Kmeans-Spark-Performances-ML-package-td19775.html [~zahili] are you interested in investigating quieting the warning in the case you describe? > Issue + Resolution: Kmeans Spark Performances (ML package) > -- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1 >Reporter: zakaria hili >Priority: Minor > Labels: easyfix > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info
[ https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18382: Assignee: Apache Spark (was: Sean Owen) > "run at null:-1" in UI when no file/line info in call site info > --- > > Key: SPARK-18382 > URL: https://issues.apache.org/jira/browse/SPARK-18382 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.0.0 > Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM >Reporter: Emiliano Amendola >Assignee: Apache Spark >Priority: Trivial > > From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, > several actually in my particular project that comprises basically of: > connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp > tables and do some aggregations with org.apache.spark.sql.Cube() method. > Link to image: http://i.stack.imgur.com/UEfgM.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info
[ https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18382: Assignee: Sean Owen (was: Apache Spark) > "run at null:-1" in UI when no file/line info in call site info > --- > > Key: SPARK-18382 > URL: https://issues.apache.org/jira/browse/SPARK-18382 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.0.0 > Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM >Reporter: Emiliano Amendola >Assignee: Sean Owen >Priority: Trivial > > From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, > several actually in my particular project that comprises basically of: > connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp > tables and do some aggregations with org.apache.spark.sql.Cube() method. > Link to image: http://i.stack.imgur.com/UEfgM.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info
[ https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659413#comment-15659413 ] Apache Spark commented on SPARK-18382: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15862 > "run at null:-1" in UI when no file/line info in call site info > --- > > Key: SPARK-18382 > URL: https://issues.apache.org/jira/browse/SPARK-18382 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.0.0 > Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM >Reporter: Emiliano Amendola >Assignee: Sean Owen >Priority: Trivial > > From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, > several actually in my particular project that comprises basically of: > connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp > tables and do some aggregations with org.apache.spark.sql.Cube() method. > Link to image: http://i.stack.imgur.com/UEfgM.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info
[ https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18382: -- Assignee: Sean Owen Priority: Trivial (was: Minor) Summary: "run at null:-1" in UI when no file/line info in call site info (was: What does “run at null:-1” mean in Apache Spark WEB UI?) This is easy to touch up cosmetically so it shows what it's "supposed" to, the default of ":0" instead of "null:-1". It looks like it happens when there are no debug symbols. Do you build Spark yourself and maybe strip these with flags like '-optimize'? > "run at null:-1" in UI when no file/line info in call site info > --- > > Key: SPARK-18382 > URL: https://issues.apache.org/jira/browse/SPARK-18382 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.0.0 > Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM >Reporter: Emiliano Amendola >Assignee: Sean Owen >Priority: Trivial > > From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, > several actually in my particular project that comprises basically of: > connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp > tables and do some aggregations with org.apache.spark.sql.Cube() method. > Link to image: http://i.stack.imgur.com/UEfgM.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18375) Upgrade netty to 4.0.42.Final
[ https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18375: -- Assignee: Guoqiang Li > Upgrade netty to 4.0.42.Final > -- > > Key: SPARK-18375 > URL: https://issues.apache.org/jira/browse/SPARK-18375 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > Fix For: 2.1.0 > > > One of the important changes for 4.0.42.Final is "Support any FileRegion > implementation when using epoll transport > [#5825|https://github.com/netty/netty/pull/5825];. > In > 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] > can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18383) Utils.isBindCollision does not properly handle all possible address-port collisions when binding
[ https://issues.apache.org/jira/browse/SPARK-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18383: -- Assignee: Guoqiang Li > Utils.isBindCollision does not properly handle all possible address-port > collisions when binding > > > Key: SPARK-18383 > URL: https://issues.apache.org/jira/browse/SPARK-18383 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > Fix For: 2.1.0 > > > When the IO mode is set to epoll, Netty uses {{io.netty.channel.unix.Socket}} > class, and {{Socket.bind}} throws an exception that is a > {{io.netty.channel.unix.Errors.NativeIoException}} instead of a > {{java.net.BindException}} instance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18383) Utils.isBindCollision does not properly handle all possible address-port collisions when binding
[ https://issues.apache.org/jira/browse/SPARK-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18383: -- Priority: Minor (was: Major) > Utils.isBindCollision does not properly handle all possible address-port > collisions when binding > > > Key: SPARK-18383 > URL: https://issues.apache.org/jira/browse/SPARK-18383 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li >Priority: Minor > Fix For: 2.1.0 > > > When the IO mode is set to epoll, Netty uses {{io.netty.channel.unix.Socket}} > class, and {{Socket.bind}} throws an exception that is a > {{io.netty.channel.unix.Errors.NativeIoException}} instead of a > {{java.net.BindException}} instance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18375) Upgrade netty to 4.0.42.Final
[ https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18375. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15830 [https://github.com/apache/spark/pull/15830] > Upgrade netty to 4.0.42.Final > -- > > Key: SPARK-18375 > URL: https://issues.apache.org/jira/browse/SPARK-18375 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li >Priority: Minor > Fix For: 2.1.0 > > > One of the important changes for 4.0.42.Final is "Support any FileRegion > implementation when using epoll transport > [#5825|https://github.com/netty/netty/pull/5825];. > In > 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] > can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18383) Utils.isBindCollision does not properly handle all possible address-port collisions when binding
[ https://issues.apache.org/jira/browse/SPARK-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18383. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15830 [https://github.com/apache/spark/pull/15830] > Utils.isBindCollision does not properly handle all possible address-port > collisions when binding > > > Key: SPARK-18383 > URL: https://issues.apache.org/jira/browse/SPARK-18383 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li > Fix For: 2.1.0 > > > When the IO mode is set to epoll, Netty uses {{io.netty.channel.unix.Socket}} > class, and {{Socket.bind}} throws an exception that is a > {{io.netty.channel.unix.Errors.NativeIoException}} instead of a > {{java.net.BindException}} instance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659318#comment-15659318 ] Sean Owen commented on SPARK-9487: -- No, it's almost certain that your changes introduced the test failure. It keeps failing. JavaAPISuite does not fail on Jenkins in master. The problem is that it's not 100% certain that a (real) failure in Jenkins is reproducible in your different, local environment. This can make debugging quite hard. Still it's worth trying to figure out how the test would fail based on Jenkins output and try to fix it; we can't merge a change that breaks tests for the build system of reference. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer
[ https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659269#comment-15659269 ] Apache Spark commented on SPARK-18294: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/15861 > Implement commit protocol to support `mapred` package's committer > - > > Key: SPARK-18294 > URL: https://issues.apache.org/jira/browse/SPARK-18294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Jiang Xingbo > > Current `FileCommitProtocol` is based on `mapreduce` package, we should > implement a `HadoopMapRedCommitProtocol` that supports the older mapred > package's commiter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18294) Implement commit protocol to support `mapred` package's committer
[ https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18294: Assignee: Apache Spark > Implement commit protocol to support `mapred` package's committer > - > > Key: SPARK-18294 > URL: https://issues.apache.org/jira/browse/SPARK-18294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Jiang Xingbo >Assignee: Apache Spark > > Current `FileCommitProtocol` is based on `mapreduce` package, we should > implement a `HadoopMapRedCommitProtocol` that supports the older mapred > package's commiter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18294) Implement commit protocol to support `mapred` package's committer
[ https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18294: Assignee: (was: Apache Spark) > Implement commit protocol to support `mapred` package's committer > - > > Key: SPARK-18294 > URL: https://issues.apache.org/jira/browse/SPARK-18294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Jiang Xingbo > > Current `FileCommitProtocol` is based on `mapreduce` package, we should > implement a `HadoopMapRedCommitProtocol` that supports the older mapred > package's commiter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659258#comment-15659258 ] Dongjoon Hyun commented on SPARK-18413: --- Oh, then, I'll make a PR for you. You can do the review. I'm also contributor of Apache Spark. :) > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org