[jira] [Assigned] (SPARK-18425) Test `CompactibleFileStreamLog` directly

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18425:


Assignee: (was: Apache Spark)

> Test `CompactibleFileStreamLog` directly
> 
>
> Key: SPARK-18425
> URL: https://issues.apache.org/jira/browse/SPARK-18425
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.0.1
>Reporter: Liwei Lin
>Priority: Minor
>
> Right now we are testing {{CompactibleFileStreamLog}} in 
> {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only 
> subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more).
> Let's do some refactor so that {{CompactibleFileStreamLog}} is directly 
> tested, making future changes to {{CompactibleFileStreamLog}} much easier to 
> test and much easier to review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18425) Test `CompactibleFileStreamLog` directly

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660880#comment-15660880
 ] 

Apache Spark commented on SPARK-18425:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15870

> Test `CompactibleFileStreamLog` directly
> 
>
> Key: SPARK-18425
> URL: https://issues.apache.org/jira/browse/SPARK-18425
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.0.1
>Reporter: Liwei Lin
>Priority: Minor
>
> Right now we are testing {{CompactibleFileStreamLog}} in 
> {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only 
> subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more).
> Let's do some refactor so that {{CompactibleFileStreamLog}} is directly 
> tested, making future changes to {{CompactibleFileStreamLog}} much easier to 
> test and much easier to review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18425) Test `CompactibleFileStreamLog` directly

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18425:


Assignee: Apache Spark

> Test `CompactibleFileStreamLog` directly
> 
>
> Key: SPARK-18425
> URL: https://issues.apache.org/jira/browse/SPARK-18425
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.0.1
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>
> Right now we are testing {{CompactibleFileStreamLog}} in 
> {{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only 
> subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more).
> Let's do some refactor so that {{CompactibleFileStreamLog}} is directly 
> tested, making future changes to {{CompactibleFileStreamLog}} much easier to 
> test and much easier to review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18425) Test `CompactibleFileStreamLog` directly

2016-11-12 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-18425:
-

 Summary: Test `CompactibleFileStreamLog` directly
 Key: SPARK-18425
 URL: https://issues.apache.org/jira/browse/SPARK-18425
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming, Tests
Affects Versions: 2.0.1
Reporter: Liwei Lin
Priority: Minor


Right now we are testing {{CompactibleFileStreamLog}} in 
{{FileStreamSinkLogSuite}} (because {{FileStreamSinkLog}} once was the only 
subclass of {{CompactibleFileStreamLog}}, but now it's not the case any more).

Let's do some refactor so that {{CompactibleFileStreamLog}} is directly tested, 
making future changes to {{CompactibleFileStreamLog}} much easier to test and 
much easier to review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660837#comment-15660837
 ] 

Apache Spark commented on SPARK-18413:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15868

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18413:


Assignee: Apache Spark

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>Assignee: Apache Spark
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18413:


Assignee: (was: Apache Spark)

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18419) Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys

2016-11-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18419:
--
Summary: Fix JDBCOptions and DataSource to be case-insensitive for 
JDBCOptions keys  (was: Fix JDBCOptions.asConnectionProperties to be 
case-insensitive )

> Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys
> --
>
> Key: SPARK-18419
> URL: https://issues.apache.org/jira/browse/SPARK-18419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` 
> correctly. For the following case, it returns `Map('numpartitions' -> "10")` 
> as a wrong result.
> {code}
> val options = new JDBCOptions(new CaseInsensitiveMap(Map(
> "url" -> "jdbc:mysql://localhost:3306/temp",
> "dbtable" -> "t1",
> "numPartitions" -> "10")))
> assert(options.asConnectionProperties.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18419) Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys

2016-11-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18419:
--
Description: 
This issue aims to fix the followings.

**A. Fix `JDBCOptions.asConnectionProperties` to be case-insensitive.**

`JDBCOptions.asConnectionProperties` is designed to filter JDBC options out, 
but it fails to handle `CaseInsensitiveMap` correctly. For the following 
example, it returns `Map('numpartitions' -> "10")` as a wrong result and the 
assertion fails.

{code}
val options = new JDBCOptions(new CaseInsensitiveMap(Map(
"url" -> "jdbc:mysql://localhost:3306/temp",
"dbtable" -> "t1",
"numPartitions" -> "10")))
assert(options.asConnectionProperties.isEmpty)
{code}

**B. Fix `DataSource` to use `CaseInsensitiveMap` consistently.**

`DataSource` partially use `CaseInsensitiveMap` in code-path. For example, the 
following fails to find `url`.

{code}
val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
df.write.format("jdbc")
.option("URL", url1)
.option("dbtable", "TEST.SAVETEST")
.options(properties.asScala)
.save()
{code}

  was:
`JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` 
correctly. For the following case, it returns `Map('numpartitions' -> "10")` as 
a wrong result.

{code}
val options = new JDBCOptions(new CaseInsensitiveMap(Map(
"url" -> "jdbc:mysql://localhost:3306/temp",
"dbtable" -> "t1",
"numPartitions" -> "10")))
assert(options.asConnectionProperties.isEmpty)
{code}


> Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys
> --
>
> Key: SPARK-18419
> URL: https://issues.apache.org/jira/browse/SPARK-18419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to fix the followings.
> **A. Fix `JDBCOptions.asConnectionProperties` to be case-insensitive.**
> `JDBCOptions.asConnectionProperties` is designed to filter JDBC options out, 
> but it fails to handle `CaseInsensitiveMap` correctly. For the following 
> example, it returns `Map('numpartitions' -> "10")` as a wrong result and the 
> assertion fails.
> {code}
> val options = new JDBCOptions(new CaseInsensitiveMap(Map(
> "url" -> "jdbc:mysql://localhost:3306/temp",
> "dbtable" -> "t1",
> "numPartitions" -> "10")))
> assert(options.asConnectionProperties.isEmpty)
> {code}
> **B. Fix `DataSource` to use `CaseInsensitiveMap` consistently.**
> `DataSource` partially use `CaseInsensitiveMap` in code-path. For example, 
> the following fails to find `url`.
> {code}
> val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
> df.write.format("jdbc")
> .option("URL", url1)
> .option("dbtable", "TEST.SAVETEST")
> .options(properties.asScala)
> .save()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.

2016-11-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18418:
---
Fix Version/s: 2.2.0

> Make release script hadoop profiles aren't correctly specified.
> ---
>
> Key: SPARK-18418
> URL: https://issues.apache.org/jira/browse/SPARK-18418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Critical
> Fix For: 2.1.0, 2.2.0
>
>
> Split from https://github.com/apache/spark/pull/15659/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.

2016-11-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-18418.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15860
[https://github.com/apache/spark/pull/15860]

> Make release script hadoop profiles aren't correctly specified.
> ---
>
> Key: SPARK-18418
> URL: https://issues.apache.org/jira/browse/SPARK-18418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Critical
> Fix For: 2.1.0
>
>
> Split from https://github.com/apache/spark/pull/15659/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.

2016-11-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18418:
---
 Assignee: holdenk
Affects Version/s: 2.1.0
 Target Version/s: 2.1.0, 2.2.0
 Priority: Critical  (was: Major)

> Make release script hadoop profiles aren't correctly specified.
> ---
>
> Key: SPARK-18418
> URL: https://issues.apache.org/jira/browse/SPARK-18418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Critical
>
> Split from https://github.com/apache/spark/pull/15659/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.

2016-11-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660416#comment-15660416
 ] 

Josh Rosen commented on SPARK-18418:


For reference: this patch fixes a bug which was introduced in SPARK-16967 and 
affects both {{master}} and {{branch-2.1}}.

> Make release script hadoop profiles aren't correctly specified.
> ---
>
> Key: SPARK-18418
> URL: https://issues.apache.org/jira/browse/SPARK-18418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: holdenk
>
> Split from https://github.com/apache/spark/pull/15659/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18418) Make release script hadoop profiles aren't correctly specified.

2016-11-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18418:
---
Component/s: Project Infra

> Make release script hadoop profiles aren't correctly specified.
> ---
>
> Key: SPARK-18418
> URL: https://issues.apache.org/jira/browse/SPARK-18418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: holdenk
>
> Split from https://github.com/apache/spark/pull/15659/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers edited comment on SPARK-18424 at 11/12/16 10:09 PM:
--

For the record I would like to work on this one.

Define Function here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Register Function here:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala


Add tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala


was (Author: bill_chambers):
For the record I would like to work on this one.

Define Function here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Register Function here:
?


Add tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:31 PM:
-

For the record I would like to work on this one.

Define Function here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Register Function here:
?


Add tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala


was (Author: bill_chambers):
For the record I would like to work on this one.

It seems that I will have to add some tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:30 PM:
-

For the record I would like to work on this one.

It seems that I will have to add some tests:
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
Here: 
https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala


was (Author: bill_chambers):
For the record I would like to work on this one.

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660291#comment-15660291
 ] 

Bill Chambers commented on SPARK-18424:
---

For the record I would like to work on this one.

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Improve Date Parsing Functionality  (was: Cumbersome Date 
Manipulation)

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. so that you can avoid entirely the 
> above conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Description: 
I've found it quite cumbersome to work with dates thus far in Spark, it can be 
hard to reason about the timeformat and what type you're working with, for 
instance:

say that I have a date in the format

{code}
2017-20-12
// Y-D-M
{code}

In order to parse that into a Date, I have to perform several conversions.
{code}
  to_date(
unix_timestamp(col("date"), dateFormat)
.cast("timestamp"))
   .alias("date")
{code}

I propose simplifying this by adding a to_date function (exists) but adding one 
that accepts a format for that date. I also propose a to_timestamp function 
that also supports a format.

so that you can avoid entirely the above conversion.

It's also worth mentioning that many other databases support this. For 
instance, mysql has the STR_TO_DATE function, netezza supports the to_timestamp 
semantic.

  was:
I've found it quite cumbersome to work with dates thus far in Spark, it can be 
hard to reason about the timeformat and what type you're working with, for 
instance:

say that I have a date in the format

{code}
2017-20-12
// Y-D-M
{code}

In order to parse that into a Date, I have to perform several conversions.
{code}
  to_date(
unix_timestamp(col("date"), dateFormat)
.cast("timestamp"))
   .alias("date")
{code}

I propose simplifying this by adding a to_date function (exists) but adding one 
that accepts a format for that date. so that you can avoid entirely the above 
conversion.


> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18424) Cumbersome Date Manipulation

2016-11-12 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-18424:
-

 Summary: Cumbersome Date Manipulation
 Key: SPARK-18424
 URL: https://issues.apache.org/jira/browse/SPARK-18424
 Project: Spark
  Issue Type: Improvement
Reporter: Bill Chambers
Priority: Minor


I've found it quite cumbersome to work with dates thus far in Spark, it can be 
hard to reason about the timeformat and what type you're working with, for 
instance:

say that I have a date in the format

{code}
2017-20-12
// Y-D-M
{code}

In order to parse that into a Date, I have to perform several conversions.
{code}
  to_date(
unix_timestamp(col("date"), dateFormat)
.cast("timestamp"))
   .alias("date")
{code}

I propose simplifying this by adding a to_date function (exists) but adding one 
that accepts a format for that date. so that you can avoid entirely the above 
conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18423:


Assignee: (was: Apache Spark)

> ReceiverTracker should close checkpoint dir when stopped even if it was not 
> started
> ---
>
> Key: SPARK-18423
> URL: https://issues.apache.org/jira/browse/SPARK-18423
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Hyukjin Kwon
>
> {code}
> Running org.apache.spark.streaming.JavaAPISuite
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> {code}
> {code}
>  mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
> milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
> {code}
> These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
> refer the discussion in 
> https://github.com/apache/spark/pull/15618#issuecomment-259660817
> Root cause is, it is being created and stopped without starting. In this 
> case, `RecieverTracker` does not close checkpoint dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660245#comment-15660245
 ] 

Apache Spark commented on SPARK-18423:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15867

> ReceiverTracker should close checkpoint dir when stopped even if it was not 
> started
> ---
>
> Key: SPARK-18423
> URL: https://issues.apache.org/jira/browse/SPARK-18423
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Hyukjin Kwon
>
> {code}
> Running org.apache.spark.streaming.JavaAPISuite
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> {code}
> {code}
>  mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
> milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
> {code}
> These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
> refer the discussion in 
> https://github.com/apache/spark/pull/15618#issuecomment-259660817
> Root cause is, it is being created and stopped without starting. In this 
> case, `RecieverTracker` does not close checkpoint dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18423:


Assignee: Apache Spark

> ReceiverTracker should close checkpoint dir when stopped even if it was not 
> started
> ---
>
> Key: SPARK-18423
> URL: https://issues.apache.org/jira/browse/SPARK-18423
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> {code}
> Running org.apache.spark.streaming.JavaAPISuite
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> {code}
> {code}
>  mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
> milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
> {code}
> These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
> refer the discussion in 
> https://github.com/apache/spark/pull/15618#issuecomment-259660817
> Root cause is, it is being created and stopped without starting. In this 
> case, `RecieverTracker` does not close checkpoint dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660237#comment-15660237
 ] 

Saikat Kanjilal commented on SPARK-9487:


Understood, so I wanted a fresh look at this from a different dev environment, 
so on my macbook pro I tried changing the setting to local[2] and local[4] for 
JavaAPISuite, it seems that they both fail so yes mimicing the real Jeankins 
failure will be hard, should I close this pull request till this is fixed and 
resubmit a new one, I have no idea at this point how long debugging this or 
even replicating this will take, thoughts on a suitable set of next steps?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started

2016-11-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18423:
-
Description: 
{code}
Running org.apache.spark.streaming.JavaAPISuite
Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< 
FAILURE! - in org.apache.spark.streaming.JavaAPISuite
testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
elapsed: 3.418 sec  <<< ERROR!
java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\target\tmp\1474255953021-0
at 
org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
{code}

{code}
 mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
milliseconds)
[info]   java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
{code}

These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
refer the discussion in 
https://github.com/apache/spark/pull/15618#issuecomment-259660817

Root cause is, it is being created and stopped without starting. In this case, 
`RecieverTracker` does not close checkpoint dir.



  was:

{code}
Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec <<< 
FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time 
elapsed: 0.047 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
at 
org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
{code}

{code}
Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< 
FAILURE! - in org.apache.spark.streaming.JavaAPISuite
testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
elapsed: 3.418 sec  <<< ERROR!
java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\target\tmp\1474255953021-0
at 
org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
Running org.apache.spark.streaming.JavaDurationSuite
{code}

{code}
Running org.apache.spark.streaming.JavaAPISuite
Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< 
FAILURE! - in org.apache.spark.streaming.JavaAPISuite
testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
elapsed: 3.418 sec  <<< ERROR!
java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\target\tmp\1474255953021-0
at 
org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
{code}

{code}
 mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
milliseconds)
[info]   java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
{code}

These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
refer the discussion in 
https://github.com/apache/spark/pull/15618#issuecomment-259660817

Root cause is, it is being created and stopped without starting. In this case, 
`RecieverTracker` does not close checkpoint dir.




> ReceiverTracker should close checkpoint dir when stopped even if it was not 
> started
> ---
>
> Key: SPARK-18423
> URL: https://issues.apache.org/jira/browse/SPARK-18423
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Reporter: Hyukjin Kwon
>
> {code}
> Running org.apache.spark.streaming.JavaAPISuite
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> {code}
> {code}
>  mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
> milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
> {code}
> These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
> refer 

[jira] [Created] (SPARK-18423) ReceiverTracker should close checkpoint dir when stopped even if it was not started

2016-11-12 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18423:


 Summary: ReceiverTracker should close checkpoint dir when stopped 
even if it was not started
 Key: SPARK-18423
 URL: https://issues.apache.org/jira/browse/SPARK-18423
 Project: Spark
  Issue Type: Sub-task
  Components: DStreams
Reporter: Hyukjin Kwon



{code}
Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec <<< 
FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time 
elapsed: 0.047 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
at 
org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
{code}

{code}
Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< 
FAILURE! - in org.apache.spark.streaming.JavaAPISuite
testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
elapsed: 3.418 sec  <<< ERROR!
java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\target\tmp\1474255953021-0
at 
org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
Running org.apache.spark.streaming.JavaDurationSuite
{code}

{code}
Running org.apache.spark.streaming.JavaAPISuite
Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec <<< 
FAILURE! - in org.apache.spark.streaming.JavaAPISuite
testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
elapsed: 3.418 sec  <<< ERROR!
java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\target\tmp\1474255953021-0
at 
org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
{code}

{code}
 mapWithState - basic operations with simple API (7 seconds, 203 milliseconds)
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.streaming.MapWithStateSuite *** ABORTED *** (12 seconds, 469 
milliseconds)
[info]   java.io.IOException: Failed to delete: 
C:\projects\spark\streaming\checkpoint\spark-226c0e37-8c46-4b2a-9c0f-2317cde31d40
{code}

These tests seem caused by not closed files in {{ReceiverTracker}}. Please 
refer the discussion in 
https://github.com/apache/spark/pull/15618#issuecomment-259660817

Root cause is, it is being created and stopped without starting. In this case, 
`RecieverTracker` does not close checkpoint dir.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18422:


Assignee: Apache Spark

> Fix wholeTextFiles test to pass on Windows in JavaAPISuite
> --
>
> Key: SPARK-18422
> URL: https://issues.apache.org/jira/browse/SPARK-18422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec 
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
> FAILURE!
> java.lang.AssertionError: 
> expected: > but was:
>   at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
> {code}
> The test failure in {{JavaAPISuite}} was due to different path format on 
> Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660163#comment-15660163
 ] 

Apache Spark commented on SPARK-18422:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15866

> Fix wholeTextFiles test to pass on Windows in JavaAPISuite
> --
>
> Key: SPARK-18422
> URL: https://issues.apache.org/jira/browse/SPARK-18422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec 
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
> FAILURE!
> java.lang.AssertionError: 
> expected: > but was:
>   at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
> {code}
> The test failure in {{JavaAPISuite}} was due to different path format on 
> Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18422:


Assignee: (was: Apache Spark)

> Fix wholeTextFiles test to pass on Windows in JavaAPISuite
> --
>
> Key: SPARK-18422
> URL: https://issues.apache.org/jira/browse/SPARK-18422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec 
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
> FAILURE!
> java.lang.AssertionError: 
> expected: > but was:
>   at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
> {code}
> The test failure in {{JavaAPISuite}} was due to different path format on 
> Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite

2016-11-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18422:
-
Component/s: Spark Core

> Fix wholeTextFiles test to pass on Windows in JavaAPISuite
> --
>
> Key: SPARK-18422
> URL: https://issues.apache.org/jira/browse/SPARK-18422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec 
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
> FAILURE!
> java.lang.AssertionError: 
> expected: > but was:
>   at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
> {code}
> The test failure in {{JavaAPISuite}} was due to different path format on 
> Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18422) Fix wholeTextFiles test to pass on Windows in JavaAPISuite

2016-11-12 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18422:


 Summary: Fix wholeTextFiles test to pass on Windows in JavaAPISuite
 Key: SPARK-18422
 URL: https://issues.apache.org/jira/browse/SPARK-18422
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Reporter: Hyukjin Kwon
Priority: Minor


{code}
Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec <<< 
FAILURE! - in org.apache.spark.JavaAPISuite
wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
FAILURE!
java.lang.AssertionError: 
expected: but was:
at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
{code}

The test failure in {{JavaAPISuite}} was due to different path format on 
Windows.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18365:
--
Description: The documentation for sample is a little unintuitive. It was 
difficult to understand why I wasn't getting exactly the fraction specified of 
my total DataFrame rows. The PR clarifies the documentation for  Scala, Python, 
and R to explain that that is expected behavior.  (was: The parameter 
documentation is switched.

PR coming shortly.)

> Improve Documentation for Sample Methods
> 
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The documentation for sample is a little unintuitive. It was difficult to 
> understand why I wasn't getting exactly the fraction specified of my total 
> DataFrame rows. The PR clarifies the documentation for  Scala, Python, and R 
> to explain that that is expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods

2016-11-12 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18365:
--
Summary: Improve Documentation for Sample Methods  (was: Improve 
Documentation for Sample Method)

> Improve Documentation for Sample Methods
> 
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-12 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659978#comment-15659978
 ] 

Don Drake commented on SPARK-18207:
---

Hi, I was able to download a nightly SNAPSHOT release and verify that this 
resolves the issue for my project.  Thanks to everyone who contributed to this 
fix and getting it merged in a timely manner.

> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18421) Dynamic disk allocation

2016-11-12 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-18421:


 Summary: Dynamic disk allocation
 Key: SPARK-18421
 URL: https://issues.apache.org/jira/browse/SPARK-18421
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Aniket Bhatnagar
Priority: Minor


Dynamic allocation feature allows you to add executors and scale computation 
power. This is great, however, I feel like we also need a way to dynamically 
scale storage. Currently, if the disk is not able to hold the spilled/shuffle 
data, the job is aborted (in yarn, the node manager kills the container) 
causing frustration and loss of time. In deployments like AWS EMR, it is 
possible to run an agent that add disks on the fly if it sees that the disks 
are running out of space and it would be great if Spark could immediately start 
using the added disks just as it does when new executors are added.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-11-12 Thread Gaoxiang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659972#comment-15659972
 ] 

Gaoxiang Liu commented on SPARK-16827:
--

ping ping..

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18420:


Assignee: (was: Apache Spark)

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659794#comment-15659794
 ] 

Apache Spark commented on SPARK-18420:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/15865

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659792#comment-15659792
 ] 

Apache Spark commented on SPARK-18420:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/15864

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18420:


Assignee: Apache Spark

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-12 Thread coneyliu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659788#comment-15659788
 ] 

coneyliu commented on SPARK-18420:
--

Fix the compile errors caused by checkstyle

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-12 Thread coneyliu (JIRA)
coneyliu created SPARK-18420:


 Summary: Fix the compile errors caused by checkstyle
 Key: SPARK-18420
 URL: https://issues.apache.org/jira/browse/SPARK-18420
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.1
Reporter: coneyliu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659688#comment-15659688
 ] 

Apache Spark commented on SPARK-18419:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15863

> Fix JDBCOptions.asConnectionProperties to be case-insensitive 
> --
>
> Key: SPARK-18419
> URL: https://issues.apache.org/jira/browse/SPARK-18419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` 
> correctly. For the following case, it returns `Map('numpartitions' -> "10")` 
> as a wrong result.
> {code}
> val options = new JDBCOptions(new CaseInsensitiveMap(Map(
> "url" -> "jdbc:mysql://localhost:3306/temp",
> "dbtable" -> "t1",
> "numPartitions" -> "10")))
> assert(options.asConnectionProperties.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18419:


Assignee: (was: Apache Spark)

> Fix JDBCOptions.asConnectionProperties to be case-insensitive 
> --
>
> Key: SPARK-18419
> URL: https://issues.apache.org/jira/browse/SPARK-18419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` 
> correctly. For the following case, it returns `Map('numpartitions' -> "10")` 
> as a wrong result.
> {code}
> val options = new JDBCOptions(new CaseInsensitiveMap(Map(
> "url" -> "jdbc:mysql://localhost:3306/temp",
> "dbtable" -> "t1",
> "numPartitions" -> "10")))
> assert(options.asConnectionProperties.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18419:


Assignee: Apache Spark

> Fix JDBCOptions.asConnectionProperties to be case-insensitive 
> --
>
> Key: SPARK-18419
> URL: https://issues.apache.org/jira/browse/SPARK-18419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> `JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` 
> correctly. For the following case, it returns `Map('numpartitions' -> "10")` 
> as a wrong result.
> {code}
> val options = new JDBCOptions(new CaseInsensitiveMap(Map(
> "url" -> "jdbc:mysql://localhost:3306/temp",
> "dbtable" -> "t1",
> "numPartitions" -> "10")))
> assert(options.asConnectionProperties.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18363) Connected component for large graph result is wrong

2016-11-12 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659686#comment-15659686
 ] 

Philip Adetiloye edited comment on SPARK-18363 at 11/12/16 1:47 PM:


duplicate graph vertice ID causes this issue


was (Author: pkadetiloye):
duplicated graph vertice ID causes this issue

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18363) Connected component for large graph result is wrong

2016-11-12 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye closed SPARK-18363.

Resolution: Resolved

duplicated graph vertice ID causes this issue

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18419) Fix JDBCOptions.asConnectionProperties to be case-insensitive

2016-11-12 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-18419:
-

 Summary: Fix JDBCOptions.asConnectionProperties to be 
case-insensitive 
 Key: SPARK-18419
 URL: https://issues.apache.org/jira/browse/SPARK-18419
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Dongjoon Hyun
Priority: Minor


`JDBCOptions.asConnectionProperties` fails to filter `CaseInsensitiveMap` 
correctly. For the following case, it returns `Map('numpartitions' -> "10")` as 
a wrong result.

{code}
val options = new JDBCOptions(new CaseInsensitiveMap(Map(
"url" -> "jdbc:mysql://localhost:3306/temp",
"dbtable" -> "t1",
"numPartitions" -> "10")))
assert(options.asConnectionProperties.isEmpty)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-11-12 Thread Aditya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659590#comment-15659590
 ] 

Aditya commented on SPARK-17116:


I dont get any error when I try to use the string as key. here is my code:
lr=LogisticRegression(maxIter=10)

model=lr.fit(final,{"maxIter":5})

Is the issue solved?

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxIter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18400) NPE when resharding Kinesis Stream

2016-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659421#comment-15659421
 ] 

Sean Owen commented on SPARK-18400:
---

OK, open a pull request with that change?

> NPE when resharding Kinesis Stream
> --
>
> Key: SPARK-18400
> URL: https://issues.apache.org/jira/browse/SPARK-18400
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: Spark 1.6 streaming from AWS Kinesis
>Reporter: Brian ONeill
>Priority: Minor
>
> Occasionally, we see an NPE when we reshard our streams:
> {code}
> java.lang.NullPointerException
>   at 
> java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106)
>  ~[?:1.8.0_60]
>   at 
> java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097) 
> ~[?:1.8.0_60]
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
>  ~[spark-streaming-kinesis-asl_2.11-1.6.4-SNAPSHOT.jar:1.6.4-SNAPSHOT]
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:245)
>  ~[spark-streaming-kinesis-asl_2.11-1.6.4-SNAPSHOT.jar:1.6.4-SNAPSHOT]
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
>  ~[spark-streaming-kinesis-asl_2.11-1.6.4-SNAPSHOT.jar:1.6.4-SNAPSHOT]
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
>  ~[amazon-kinesis-client-1.6.2.jar:?]
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:100)
>  [amazon-kinesis-client-1.6.2.jar:?]
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
>  [amazon-kinesis-client-1.6.2.jar:?]
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
>  [amazon-kinesis-client-1.6.2.jar:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_60]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_60]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_60]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18213) Syntactic sugar over Pipeline API

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18213.
---
Resolution: Won't Fix

> Syntactic sugar over Pipeline API
> -
>
> Key: SPARK-18213
> URL: https://issues.apache.org/jira/browse/SPARK-18213
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Wojciech Szymanski
>Priority: Minor
>
> Currently, creating ML Pipeline is based on very verbose setStages method as 
> below:
> {code}
> val tokenizer = new RegexTokenizer()
> val stopWordsRemover = new StopWordsRemover()
> val countVectorizer = new CountVectorizer()
> val pipeline = new Pipeline().setStages(Array(tokenizer, 
> stopWordsRemover, countVectorizer))
> {code}
> What about a bit of syntactic sugar over Pipeline API?
> {code}
> val tokenizer = new RegexTokenizer()
> val stopWordsRemover = new StopWordsRemover()
> val countVectorizer = new CountVectorizer()
> val pipeline = tokenizer + stopWordsRemover + countVectorizer
> {code}
> Production code changes in 
> mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala:
> https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-5226e84dea43423760dc6300ddafb01b
> Scala example:
> https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-798e85dd9107565fabab1126f57e3d6e
> Java example:
> https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-69ac857220f21b5e168d80d6dffe
> Thanks in advance for your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18402) spark: SAXParseException while writing from json to parquet on s3

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18402.
---
Resolution: Not A Problem

OK, closing it on this end until there's a Spark-side action to take.

> spark: SAXParseException while writing from json to parquet on s3
> -
>
> Key: SPARK-18402
> URL: https://issues.apache.org/jira/browse/SPARK-18402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.2, 2.0.1
> Environment: spark 2.0.1 hadoop 2.7.1
> hadoop aws 2.7.1
> ubuntu 14.04.5 on aws
> mesos 1.0.1
> Java 1.7.0_111, openjdk
>Reporter: Luke Miner
>
> I'm trying to read in some json, infer a schema, and write it out again as 
> parquet to s3 (s3a). For some reason, about a third of the way through the 
> writing portion of the run, spark always errors out with the error included 
> below. 
> I can't find any obvious reasons for the issue:
> - it isn't out of memory and I have tried increasing the overhead memory
> - there are no long GC pauses.
> - There don't seem to be any additional error messages in the logs of the 
> individual executors.
> - This does not appear to be a problem with badly formed json or corrupted 
> files. I have unzipped and read in each file individually with no error.
> The script runs fine on another set of data that I have, which is of a very 
> similar structure, but several orders of magnitude smaller.
> I am using the FileOutputCommitter. The algorithm version doesn't seem to 
> matter.
> Here's a simplified version of the script:
> {code}
> object Foo {
>   def parseJson(json: String): Option[Map[String, Any]] = {
> if (json == null)
>   Some(Map())
> else
>   parseOpt(json).map((j: JValue) => j.values.asInstanceOf[Map[String, 
> Any]])
>   }
>   }
> }
> // read in as text and parse json using json4s
> val jsonRDD: RDD[String] = sc.textFile(inputPath)
> .map(row -> Foo.parseJson(row))
> // infer a schema that will encapsulate the most rows in a sample of size 
> sampleRowNum
> val schema: StructType = Infer.getMostCommonSchema(sc, jsonRDD, 
> sampleRowNum)
> // get documents compatibility with schema
> val jsonWithCompatibilityRDD: RDD[(String, Boolean)] = jsonRDD
>   .map(js => (js, Infer.getSchemaCompatibility(schema, 
> Infer.inferSchema(js)).toBoolean))
>   .repartition(partitions)
> val jsonCompatibleRDD: RDD[String] = jsonWithCompatibilityRDD
>   .filter { case (js: String, compatible: Boolean) => compatible }
>   .map { case (js: String, _: Boolean) => js }
> // create a dataframe from documents with compatible schema
> val dataFrame: DataFrame = 
> spark.read.schema(schema).json(jsonCompatibleRDD)
> dataFrame.write.parquet("s3a://foo/foo")
> {code}
> It completes the earlier schema inferring steps successfully. The error 
> itself occurs on the last line, but I suppose that could encompass at least 
> the immediately preceding statement, if not earlier:
> {code}
> org.apache.spark.SparkException: Task failed while writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:275)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:257)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at 
> 

[jira] [Resolved] (SPARK-18354) Memory Leak in SQLListener and JobProgressListener

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18354.
---
Resolution: Not A Problem

> Memory Leak in SQLListener and JobProgressListener
> --
>
> Key: SPARK-18354
> URL: https://issues.apache.org/jira/browse/SPARK-18354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Cong Tam
> Attachments: Leak_Suspects.zip, screenshot-1.png
>
>
> There might be memory leak in SQLListener and JobProgressListener classes 
> while running Spark SQL.
> Please find attachment leak suspect report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659416#comment-15659416
 ] 

Sean Owen commented on SPARK-18356:
---

CC [~josephkb] as this was a follow up to your comment at 
http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-Resolution-Kmeans-Spark-Performances-ML-package-td19775.html

[~zahili] are you interested in investigating quieting the warning in the case 
you describe?

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18382:


Assignee: Apache Spark  (was: Sean Owen)

> "run at null:-1" in UI when no file/line info in call site info
> ---
>
> Key: SPARK-18382
> URL: https://issues.apache.org/jira/browse/SPARK-18382
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
> Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM
>Reporter: Emiliano Amendola
>Assignee: Apache Spark
>Priority: Trivial
>
> From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, 
> several actually in my particular project that comprises basically of: 
> connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp 
> tables and do some aggregations with org.apache.spark.sql.Cube() method.
> Link to image: http://i.stack.imgur.com/UEfgM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18382:


Assignee: Sean Owen  (was: Apache Spark)

> "run at null:-1" in UI when no file/line info in call site info
> ---
>
> Key: SPARK-18382
> URL: https://issues.apache.org/jira/browse/SPARK-18382
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
> Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM
>Reporter: Emiliano Amendola
>Assignee: Sean Owen
>Priority: Trivial
>
> From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, 
> several actually in my particular project that comprises basically of: 
> connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp 
> tables and do some aggregations with org.apache.spark.sql.Cube() method.
> Link to image: http://i.stack.imgur.com/UEfgM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659413#comment-15659413
 ] 

Apache Spark commented on SPARK-18382:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15862

> "run at null:-1" in UI when no file/line info in call site info
> ---
>
> Key: SPARK-18382
> URL: https://issues.apache.org/jira/browse/SPARK-18382
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
> Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM
>Reporter: Emiliano Amendola
>Assignee: Sean Owen
>Priority: Trivial
>
> From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, 
> several actually in my particular project that comprises basically of: 
> connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp 
> tables and do some aggregations with org.apache.spark.sql.Cube() method.
> Link to image: http://i.stack.imgur.com/UEfgM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18382) "run at null:-1" in UI when no file/line info in call site info

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18382:
--
Assignee: Sean Owen
Priority: Trivial  (was: Minor)
 Summary: "run at null:-1" in UI when no file/line info in call site info  
(was: What does “run at null:-1” mean in Apache Spark WEB UI?)

This is easy to touch up cosmetically so it shows what it's "supposed" to, the 
default of ":0" instead of "null:-1". 

It looks like it happens when there are no debug symbols. Do you build Spark 
yourself and maybe strip these with flags like '-optimize'?

> "run at null:-1" in UI when no file/line info in call site info
> ---
>
> Key: SPARK-18382
> URL: https://issues.apache.org/jira/browse/SPARK-18382
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
> Environment: Windows 10, Scala Eclipse Luna, Intel i3, 6gb RAM
>Reporter: Emiliano Amendola
>Assignee: Sean Owen
>Priority: Trivial
>
> From my Apache WEB UI dashboard. I've seen a lot of this run at null:-1 jobs, 
> several actually in my particular project that comprises basically of: 
> connect to a JDBC PostgreSQL Server, fetch some tables, creating some temp 
> tables and do some aggregations with org.apache.spark.sql.Cube() method.
> Link to image: http://i.stack.imgur.com/UEfgM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18375) Upgrade netty to 4.0.42.Final

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18375:
--
Assignee: Guoqiang Li

> Upgrade netty to 4.0.42.Final 
> --
>
> Key: SPARK-18375
> URL: https://issues.apache.org/jira/browse/SPARK-18375
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> One of the important changes for 4.0.42.Final is "Support any FileRegion 
> implementation when using epoll transport 
> [#5825|https://github.com/netty/netty/pull/5825];.
> In 
> 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
>  can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18383) Utils.isBindCollision does not properly handle all possible address-port collisions when binding

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18383:
--
Assignee: Guoqiang Li

> Utils.isBindCollision does not properly handle all possible address-port 
> collisions when binding
> 
>
> Key: SPARK-18383
> URL: https://issues.apache.org/jira/browse/SPARK-18383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> When the IO mode is set to epoll, Netty uses {{io.netty.channel.unix.Socket}} 
> class, and {{Socket.bind}}  throws an exception that is a 
> {{io.netty.channel.unix.Errors.NativeIoException}}  instead of a 
> {{java.net.BindException}} instance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18383) Utils.isBindCollision does not properly handle all possible address-port collisions when binding

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18383:
--
Priority: Minor  (was: Major)

> Utils.isBindCollision does not properly handle all possible address-port 
> collisions when binding
> 
>
> Key: SPARK-18383
> URL: https://issues.apache.org/jira/browse/SPARK-18383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> When the IO mode is set to epoll, Netty uses {{io.netty.channel.unix.Socket}} 
> class, and {{Socket.bind}}  throws an exception that is a 
> {{io.netty.channel.unix.Errors.NativeIoException}}  instead of a 
> {{java.net.BindException}} instance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18375) Upgrade netty to 4.0.42.Final

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18375.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15830
[https://github.com/apache/spark/pull/15830]

> Upgrade netty to 4.0.42.Final 
> --
>
> Key: SPARK-18375
> URL: https://issues.apache.org/jira/browse/SPARK-18375
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> One of the important changes for 4.0.42.Final is "Support any FileRegion 
> implementation when using epoll transport 
> [#5825|https://github.com/netty/netty/pull/5825];.
> In 
> 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
>  can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18383) Utils.isBindCollision does not properly handle all possible address-port collisions when binding

2016-11-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18383.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15830
[https://github.com/apache/spark/pull/15830]

> Utils.isBindCollision does not properly handle all possible address-port 
> collisions when binding
> 
>
> Key: SPARK-18383
> URL: https://issues.apache.org/jira/browse/SPARK-18383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
> Fix For: 2.1.0
>
>
> When the IO mode is set to epoll, Netty uses {{io.netty.channel.unix.Socket}} 
> class, and {{Socket.bind}}  throws an exception that is a 
> {{io.netty.channel.unix.Errors.NativeIoException}}  instead of a 
> {{java.net.BindException}} instance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659318#comment-15659318
 ] 

Sean Owen commented on SPARK-9487:
--

No, it's almost certain that your changes introduced the test failure. It keeps 
failing. JavaAPISuite does not fail on Jenkins in master.
The problem is that it's not 100% certain that a (real) failure in Jenkins is 
reproducible in your different, local environment. This can make debugging 
quite hard. Still it's worth trying to figure out how the test would fail based 
on Jenkins output and try to fix it; we can't merge a change that breaks tests 
for the build system of reference.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2016-11-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659269#comment-15659269
 ] 

Apache Spark commented on SPARK-18294:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15861

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18294:


Assignee: Apache Spark

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2016-11-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18294:


Assignee: (was: Apache Spark)

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659258#comment-15659258
 ] 

Dongjoon Hyun commented on SPARK-18413:
---

Oh, then, I'll make a PR for you. You can do the review.
I'm also contributor of Apache Spark. :)

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org