[jira] [Commented] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383242#comment-16383242
 ] 

Joseph K. Bradley commented on SPARK-22883:
---

Merged part 1 of 2 to master and branch-2.3

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21209) Implement Incremental PCA algorithm for ML

2018-03-01 Thread Sandeep Kumar Choudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383228#comment-16383228
 ] 

Sandeep Kumar Choudhary commented on SPARK-21209:
-

Hi Ben St. Clair. I have implemented Incremental PCA  model from:
`D. Ross, J. Lim, R. Lin, M. Yang, Incremental Learning for Robust Visual
Tracking, International Journal of Computer Vision, Volume 77, Issue 1-3,
pp. 125-141, May 2008.`
 
See http://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf

I would like to have discussion on this issue.

> Implement Incremental PCA algorithm for ML
> --
>
> Key: SPARK-21209
> URL: https://issues.apache.org/jira/browse/SPARK-21209
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ben St. Clair
>Priority: Major
>  Labels: features
>
> Incremental Principal Component Analysis is a method for calculating PCAs in 
> an incremental fashion, allowing one to update an existing PCA model as new 
> evidence arrives. Furthermore, an alpha parameter can be used to enable 
> task-specific weighting of new and old evidence.
> This algorithm would be useful for streaming applications, where a fast and 
> adaptive feature subspace calculation could be applied. Furthermore, it can 
> be applied to combine PCAs from subcomponents of large datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22883:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22883:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-22883:
---

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Fix Version/s: (was: 2.4.0)

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22883:
-

Assignee: Joseph K. Bradley

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-03-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22883.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20111
[https://github.com/apache/spark/pull/20111]

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383157#comment-16383157
 ] 

Apache Spark commented on SPARK-23434:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20715

> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), 
> it warns with a wrong error message during looking up 
> `people.json/_spark_metadata`. The root cause of this istuation is the 
> difference between `LocalFileSystem` and `DistributedFileSystem`. 
> `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}
> {code}
> scala> spark.version
> res0: String = 2.2.1
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> {code}
> {code}
> scala> spark.version
> res0: String = 2.1.2
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory.
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> {code}
> {code}
> scala> spark.version
> res0: String = 2.0.2
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383156#comment-16383156
 ] 

Apache Spark commented on SPARK-23457:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20714

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23434) Spark should not warn `metadata directory` for a HDFS file path

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383153#comment-16383153
 ] 

Apache Spark commented on SPARK-23434:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20713

> Spark should not warn `metadata directory` for a HDFS file path
> ---
>
> Key: SPARK-23434
> URL: https://issues.apache.org/jira/browse/SPARK-23434
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), 
> it warns with a wrong error message during looking up 
> `people.json/_spark_metadata`. The root cause of this istuation is the 
> difference between `LocalFileSystem` and `DistributedFileSystem`. 
> `LocalFileSystem.exists()` returns `false`, but 
> `DistributedFileSystem.exists` raises Exception.
> {code}
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
> scala> 
> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> scala> spark.read.json("hdfs:///tmp/people.json")
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for 
> metadata directory.
> res6: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
> {code}
> {code}
> scala> spark.version
> res0: String = 2.2.1
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> 18/02/15 05:28:02 WARN FileStreamSink: Error while looking for metadata 
> directory.
> {code}
> {code}
> scala> spark.version
> res0: String = 2.1.2
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:29:53 WARN DataSource: Error while looking for metadata directory.
> ++---+
> | age|   name|
> ++---+
> |null|Michael|
> |  30|   Andy|
> |  19| Justin|
> ++---+
> {code}
> {code}
> scala> spark.version
> res0: String = 2.0.2
> scala> spark.read.json("hdfs:///tmp/people.json").show
> 18/02/15 05:25:24 WARN DataSource: Error while looking for metadata directory.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23554) Hive's textinputformat.record.delimiter equivalent in Spark

2018-03-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383146#comment-16383146
 ] 

Hyukjin Kwon commented on SPARK-23554:
--

I think it's a duplicate of SPARK-21289?

> Hive's textinputformat.record.delimiter equivalent in Spark
> ---
>
> Key: SPARK-23554
> URL: https://issues.apache.org/jira/browse/SPARK-23554
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: csv, csvparser
>
> It would be great if Spark would support an option similar to Hive's 
> {{textinputformat.record.delimiter }} in spark-csv reader.
> We currently have to create Hive tables to workaround this missing 
> functionality natively in Spark.
> {{textinputformat.record.delimiter}} was introduced back in 2011 in 
> map-reduce era -
>  see MAPREDUCE-2254.
> As an example, one of the most common use cases for us involving 
> {{textinputformat.record.delimiter}} is to read multiple lines of text that 
> make up a "record". Number of actual lines per "record" is varying and so 
> {{textinputformat.record.delimiter}} is a great solution for us to process 
> these files natively in Hadoop/Spark (custom .map() function then actually 
> does processing of those records), and we convert it to a dataframe.. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23563) make the size fo cache in CodeGenerator configable

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23563:


Assignee: (was: Apache Spark)

> make the size fo cache in CodeGenerator configable
> --
>
> Key: SPARK-23563
> URL: https://issues.apache.org/jira/browse/SPARK-23563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kejiqing
>Priority: Minor
>
> the cache in class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard 
> cod maxmunSize 100, current code is:
>  
> {code:java}
> // scala
> private val cache = CacheBuilder.newBuilder()
>   .maximumSize(100)
>   .build(
> new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
>   override def load(code: CodeAndComment): (GeneratedClass, Int) = {
> val startTime = System.nanoTime()
> val result = doCompile(code)
> val endTime = System.nanoTime()
> def timeMs: Double = (endTime - startTime).toDouble / 100
> CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
> CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
> logInfo(s"Code generated in $timeMs ms")
> result
>   }
> })
> {code}
>  In some specific situation, for example: a long term and spark tasks are 
> unchanged,  the size of cache maximumSize configuration is a better idea.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23563) make the size fo cache in CodeGenerator configable

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23563:


Assignee: Apache Spark

> make the size fo cache in CodeGenerator configable
> --
>
> Key: SPARK-23563
> URL: https://issues.apache.org/jira/browse/SPARK-23563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kejiqing
>Assignee: Apache Spark
>Priority: Minor
>
> the cache in class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard 
> cod maxmunSize 100, current code is:
>  
> {code:java}
> // scala
> private val cache = CacheBuilder.newBuilder()
>   .maximumSize(100)
>   .build(
> new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
>   override def load(code: CodeAndComment): (GeneratedClass, Int) = {
> val startTime = System.nanoTime()
> val result = doCompile(code)
> val endTime = System.nanoTime()
> def timeMs: Double = (endTime - startTime).toDouble / 100
> CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
> CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
> logInfo(s"Code generated in $timeMs ms")
> result
>   }
> })
> {code}
>  In some specific situation, for example: a long term and spark tasks are 
> unchanged,  the size of cache maximumSize configuration is a better idea.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23563) make the size fo cache in CodeGenerator configable

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383129#comment-16383129
 ] 

Apache Spark commented on SPARK-23563:
--

User 'passionke' has created a pull request for this issue:
https://github.com/apache/spark/pull/20712

> make the size fo cache in CodeGenerator configable
> --
>
> Key: SPARK-23563
> URL: https://issues.apache.org/jira/browse/SPARK-23563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kejiqing
>Priority: Minor
>
> the cache in class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard 
> cod maxmunSize 100, current code is:
>  
> {code:java}
> // scala
> private val cache = CacheBuilder.newBuilder()
>   .maximumSize(100)
>   .build(
> new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
>   override def load(code: CodeAndComment): (GeneratedClass, Int) = {
> val startTime = System.nanoTime()
> val result = doCompile(code)
> val endTime = System.nanoTime()
> def timeMs: Double = (endTime - startTime).toDouble / 100
> CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
> CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
> logInfo(s"Code generated in $timeMs ms")
> result
>   }
> })
> {code}
>  In some specific situation, for example: a long term and spark tasks are 
> unchanged,  the size of cache maximumSize configuration is a better idea.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23542) The exists action shoule be further optimized in logical plan

2018-03-01 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Description: 
The optimized logical plan of query '*select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)*' is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

The `exists` action will be rewritten as semi jion. But i the query of `*select 
* from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized logical plan is 
:
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of '*select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

  was:
The optimized logical plan of query '*select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)*' is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

But the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*` is :
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of '*select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 


> The exists action shoule be further optimized in logical plan
> -
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> The `exists` action will be rewritten as semi jion. But i the query of 
> `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized 
> logical plan is :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-23542) The exists action shoule be further optimized in logical plan

2018-03-01 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Summary: The exists action shoule be further optimized in logical plan  
(was: The `where exists' action in optimized logical plan should be optimized )

> The exists action shoule be further optimized in logical plan
> -
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> But the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*` is 
> :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23563) make the size fo cache in CodeGenerator configable

2018-03-01 Thread kejiqing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383005#comment-16383005
 ] 

kejiqing commented on SPARK-23563:
--

a long term spark sql task, the meta space in driver will increase endless, and 
will lead full gc frequently.

a suitable cache size can slow down the speed of new class generate in jvm and 
rise the ratio of cache hit 

> make the size fo cache in CodeGenerator configable
> --
>
> Key: SPARK-23563
> URL: https://issues.apache.org/jira/browse/SPARK-23563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kejiqing
>Priority: Minor
>
> the cache in class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard 
> cod maxmunSize 100, current code is:
>  
> {code:java}
> // scala
> private val cache = CacheBuilder.newBuilder()
>   .maximumSize(100)
>   .build(
> new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
>   override def load(code: CodeAndComment): (GeneratedClass, Int) = {
> val startTime = System.nanoTime()
> val result = doCompile(code)
> val endTime = System.nanoTime()
> def timeMs: Double = (endTime - startTime).toDouble / 100
> CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
> CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
> logInfo(s"Code generated in $timeMs ms")
> result
>   }
> })
> {code}
>  In some specific situation, for example: a long term and spark tasks are 
> unchanged,  the size of cache maximumSize configuration is a better idea.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23552) Dataset.withColumn does not allow overriding of a struct field

2018-03-01 Thread David Capwell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383002#comment-16383002
 ] 

David Capwell commented on SPARK-23552:
---

This appears to also be a issue with drop.

> Dataset.withColumn does not allow overriding of a struct field
> --
>
> Key: SPARK-23552
> URL: https://issues.apache.org/jira/browse/SPARK-23552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: David Capwell
>Priority: Major
>
> We have a Dataset with a schema such as the following: 
>  
>  
> {code:java}
> struct>
> {code}
>  
>  
> if we do the following, one would expect to override the type of bar, but 
> instead a new column gets added
>  
>  
> {code:java}
> ds.withColumn("foo.bar", ...){code}
>  
> This produces the following schema
>  
> {code:java}
> struct, foo.bar: ...>{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23563) make the size fo cache in CodeGenerator configable

2018-03-01 Thread kejiqing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kejiqing updated SPARK-23563:
-
Summary: make the size fo cache in CodeGenerator configable  (was: make 
size fo cache in CodeGenerator configable)

> make the size fo cache in CodeGenerator configable
> --
>
> Key: SPARK-23563
> URL: https://issues.apache.org/jira/browse/SPARK-23563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kejiqing
>Priority: Minor
>
> the cache in class 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard 
> cod maxmunSize 100, current code is:
>  
> {code:java}
> // scala
> private val cache = CacheBuilder.newBuilder()
>   .maximumSize(100)
>   .build(
> new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
>   override def load(code: CodeAndComment): (GeneratedClass, Int) = {
> val startTime = System.nanoTime()
> val result = doCompile(code)
> val endTime = System.nanoTime()
> def timeMs: Double = (endTime - startTime).toDouble / 100
> CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
> CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
> logInfo(s"Code generated in $timeMs ms")
> result
>   }
> })
> {code}
>  In some specific situation, for example: a long term and spark tasks are 
> unchanged,  the size of cache maximumSize configuration is a better idea.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23563) make size fo cache in CodeGenerator configable

2018-03-01 Thread kejiqing (JIRA)
kejiqing created SPARK-23563:


 Summary: make size fo cache in CodeGenerator configable
 Key: SPARK-23563
 URL: https://issues.apache.org/jira/browse/SPARK-23563
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: kejiqing


the cache in class 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator has a hard cod 
maxmunSize 100, current code is:

 
{code:java}
// scala
private val cache = CacheBuilder.newBuilder()
  .maximumSize(100)
  .build(
new CacheLoader[CodeAndComment, (GeneratedClass, Int)]() {
  override def load(code: CodeAndComment): (GeneratedClass, Int) = {
val startTime = System.nanoTime()
val result = doCompile(code)
val endTime = System.nanoTime()
def timeMs: Double = (endTime - startTime).toDouble / 100
CodegenMetrics.METRIC_SOURCE_CODE_SIZE.update(code.body.length)
CodegenMetrics.METRIC_COMPILATION_TIME.update(timeMs.toLong)
logInfo(s"Code generated in $timeMs ms")
result
  }
})
{code}
 In some specific situation, for example: a long term and spark tasks are 
unchanged,  the size of cache maximumSize configuration is a better idea.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23551.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20704
[https://github.com/apache/spark/pull/20704]

> Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
> --
>
> Key: SPARK-23551
> URL: https://issues.apache.org/jira/browse/SPARK-23551
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> This issue aims to prevent `orc-mapreduce` dependency from making IDEs and 
> maven confused.
> *BEFORE*
> Please note that 2.6.4 at Spark Project SQL.
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
> [INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
> {code}
> *AFTER*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-03-01 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23562:
---

 Summary: RFormula handleInvalid should handle invalid values in 
non-string columns.
 Key: SPARK-23562
 URL: https://issues.apache.org/jira/browse/SPARK-23562
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
String fields. Numeric fields that are null will either cause the transformer 
to fail or might be null in the resulting label column.

I'm not sure what the semantics of keep might be for numeric columns with null 
values, but we should be able to at least support skip for these types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23551:
--

Assignee: Dongjoon Hyun

> Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
> --
>
> Key: SPARK-23551
> URL: https://issues.apache.org/jira/browse/SPARK-23551
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> This issue aims to prevent `orc-mapreduce` dependency from making IDEs and 
> maven confused.
> *BEFORE*
> Please note that 2.6.4 at Spark Project SQL.
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
> [INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
> {code}
> *AFTER*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.

2018-03-01 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382963#comment-16382963
 ] 

Marcelo Vanzin commented on SPARK-19181:


Another failure (after quite some time):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87855/testReport/junit/org.apache.spark.scheduler/SparkListenerSuite/local_metrics/

> SparkListenerSuite.local metrics fails when average executorDeserializeTime 
> is too short.
> -
>
> Key: SPARK-19181
> URL: https://issues.apache.org/jira/browse/SPARK-19181
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>Priority: Minor
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249
> The "local metrics" test asserts that tasks should take more than 1ms on 
> average to complete, even though a code comment notes that this is a small 
> test and tasks may finish faster. I've been seeing some "failures" here on 
> fast systems that finish these tasks quite quickly.
> There are a few ways forward here:
> 1. Disable this test.
> 2. Relax this check.
> 3. Implement sub-millisecond granularity for task times throughout Spark.
> 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task 
> reference a partition that implements a custom Externalizable.readExternal, 
> which always waits 1ms before returning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-01 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382932#comment-16382932
 ] 

Saisai Shao commented on SPARK-23534:
-

One issue is Hive 1.2.1.spark2 rejects Hadoop 3 (SPARK-18673)

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-03-01 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382929#comment-16382929
 ] 

Saisai Shao commented on SPARK-18673:
-

Spark itself uses it own hive (1.2.1.spark2), I think we need to port 
HIVE-15016 to this hive (1.2.1.spark2) to make it work with Hadoop 3.

[~aihuaxu] is HIVE-15016 the only fix to support Hadoop3.x, or there still has 
followup fixes?

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23498) Accuracy problem in comparison with string and integer

2018-03-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23498:
--
Target Version/s:   (was: 2.3.1)

> Accuracy problem in comparison with string and integer
> --
>
> Key: SPARK-23498
> URL: https://issues.apache.org/jira/browse/SPARK-23498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kevin Zhang
>Priority: Major
>
> While comparing a string column with integer value, spark sql will 
> automatically cast the string operant to int, the following sql will return 
> true in hive but false in spark
>  
> {code:java}
> select '1000.1'>1000
> {code}
>  
>  from the physical plan we can see the string operant was cast to int which 
> caused the accuracy loss
> {code:java}
> *Project [false AS (CAST(1000.1 AS INT) > 1000)#4]
> +- Scan OneRowRelation[]
> {code}
> To solve it, using a wider common type like double to cast both sides of 
> operant of a binary operator may be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23498) Accuracy problem in comparison with string and integer

2018-03-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382912#comment-16382912
 ] 

Dongjoon Hyun edited comment on SPARK-23498 at 3/2/18 12:12 AM:


[~KevinZwx]. Did you see HIVE-17186? Hive doesn't give you the correct result. 
For the following TPC-H query, it will exclude '0.07'.
{code}
select l_discount from lineitem where l_discount between 0.06 - 0.01 and 0.06 + 
0.01
{code}
I'm not disagreeing with this issue. I want to give you a well-known example 
where you cannot trust Hive result.


was (Author: dongjoon):
[~KevinZwx]. Did you see HIVE-17186? Hive doesn't give you the correct result. 
For the following TPC-H query, it will exclude '0.07'.
{code}
select l_discount from lineitem where l_discount between 0.06 - 0.01 and 0.06 + 
0.01
{code}

> Accuracy problem in comparison with string and integer
> --
>
> Key: SPARK-23498
> URL: https://issues.apache.org/jira/browse/SPARK-23498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kevin Zhang
>Priority: Major
>
> While comparing a string column with integer value, spark sql will 
> automatically cast the string operant to int, the following sql will return 
> true in hive but false in spark
>  
> {code:java}
> select '1000.1'>1000
> {code}
>  
>  from the physical plan we can see the string operant was cast to int which 
> caused the accuracy loss
> {code:java}
> *Project [false AS (CAST(1000.1 AS INT) > 1000)#4]
> +- Scan OneRowRelation[]
> {code}
> To solve it, using a wider common type like double to cast both sides of 
> operant of a binary operator may be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23559) add epoch ID to data writer factory

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23559:


Assignee: (was: Apache Spark)

> add epoch ID to data writer factory
> ---
>
> Key: SPARK-23559
> URL: https://issues.apache.org/jira/browse/SPARK-23559
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has 
> to be specifiable at DataWriter creation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23559) add epoch ID to data writer factory

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23559:


Assignee: Apache Spark

> add epoch ID to data writer factory
> ---
>
> Key: SPARK-23559
> URL: https://issues.apache.org/jira/browse/SPARK-23559
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has 
> to be specifiable at DataWriter creation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23498) Accuracy problem in comparison with string and integer

2018-03-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382912#comment-16382912
 ] 

Dongjoon Hyun commented on SPARK-23498:
---

[~KevinZwx]. Did you see HIVE-17186? Hive doesn't give you the correct result. 
For the following TPC-H query, it will exclude '0.07'.
{code}
select l_discount from lineitem where l_discount between 0.06 - 0.01 and 0.06 + 
0.01
{code}

> Accuracy problem in comparison with string and integer
> --
>
> Key: SPARK-23498
> URL: https://issues.apache.org/jira/browse/SPARK-23498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kevin Zhang
>Priority: Major
>
> While comparing a string column with integer value, spark sql will 
> automatically cast the string operant to int, the following sql will return 
> true in hive but false in spark
>  
> {code:java}
> select '1000.1'>1000
> {code}
>  
>  from the physical plan we can see the string operant was cast to int which 
> caused the accuracy loss
> {code:java}
> *Project [false AS (CAST(1000.1 AS INT) > 1000)#4]
> +- Scan OneRowRelation[]
> {code}
> To solve it, using a wider common type like double to cast both sides of 
> operant of a binary operator may be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23559) add epoch ID to data writer factory

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382911#comment-16382911
 ] 

Apache Spark commented on SPARK-23559:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20710

> add epoch ID to data writer factory
> ---
>
> Key: SPARK-23559
> URL: https://issues.apache.org/jira/browse/SPARK-23559
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has 
> to be specifiable at DataWriter creation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23561) make StreamWriter not a DataSourceWriter subclass

2018-03-01 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23561:
---

 Summary: make StreamWriter not a DataSourceWriter subclass
 Key: SPARK-23561
 URL: https://issues.apache.org/jira/browse/SPARK-23561
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


The inheritance makes little sense now; they've almost entirely diverged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23560) A joinWith followed by groupBy requires extra shuffle

2018-03-01 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-23560:
-

 Summary: A joinWith followed by groupBy requires extra shuffle
 Key: SPARK-23560
 URL: https://issues.apache.org/jira/browse/SPARK-23560
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
 Environment: debian 8.9, macos x high sierra
Reporter: Bruce Robbins


Depending on the size of the input, a joinWith followed by a groupBy requires 
more shuffles than a join followed by a groupBy.

For example, here's a joinWith on two CSV files, followed by a groupBy:
{noformat}
import org.apache.spark.sql.types._
val schema = StructType(StructField("id1", LongType) :: StructField("id2", 
LongType) :: Nil)

val df1 = spark.read.schema(schema).csv("ds1.csv")
val df2 = spark.read.schema(schema).csv("ds2.csv")

val result1 = df1.joinWith(df2, df1.col("id1") === 
df2.col("id2")).groupBy("_1.id1").count

result1.explain
== Physical Plan ==
*(6) HashAggregate(keys=[_1#8.id1#19L], functions=[count(1)])
+- Exchange hashpartitioning(_1#8.id1#19L, 200)
   +- *(5) HashAggregate(keys=[_1#8.id1 AS _1#8.id1#19L], 
functions=[partial_count(1)])
  +- *(5) Project [_1#8]
 +- *(5) SortMergeJoin [_1#8.id1], [_2#9.id2], Inner
:- *(2) Sort [_1#8.id1 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(_1#8.id1, 200)
: +- *(1) Project [named_struct(id1, id1#0L, id2, id2#1L) AS 
_1#8]
:+- *(1) FileScan csv [id1#0L,id2#1L] Batched: false, 
Format: CSV, Location: InMemoryFileIndex[file:.../ds1.csv], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct
+- *(4) Sort [_2#9.id2 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(_2#9.id2, 200)
  +- *(3) Project [named_struct(id1, id1#4L, id2, id2#5L) AS 
_2#9]
 +- *(3) FileScan csv [id1#4L,id2#5L] Batched: false, 
Format: CSV, Location: InMemoryFileIndex[file:...ds2.csv], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct
{noformat}
Using join, there is one less shuffle:
{noformat}
val result2 = df1.join(df2,  df1.col("id1") === 
df2.col("id2")).groupBy(df1("id1")).count

result2.explain
== Physical Plan ==
*(5) HashAggregate(keys=[id1#0L], functions=[count(1)])
+- *(5) HashAggregate(keys=[id1#0L], functions=[partial_count(1)])
   +- *(5) Project [id1#0L]
  +- *(5) SortMergeJoin [id1#0L], [id2#5L], Inner
 :- *(2) Sort [id1#0L ASC NULLS FIRST], false, 0
 :  +- Exchange hashpartitioning(id1#0L, 200)
 : +- *(1) Project [id1#0L]
 :+- *(1) Filter isnotnull(id1#0L)
 :   +- *(1) FileScan csv [id1#0L] Batched: false, Format: CSV, 
Location: InMemoryFileIndex[file:.../ds1.csv], PartitionFilters: [], 
PushedFilters: [IsNotNull(id1)], ReadSchema: struct
 +- *(4) Sort [id2#5L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id2#5L, 200)
   +- *(3) Project [id2#5L]
  +- *(3) Filter isnotnull(id2#5L)
 +- *(3) FileScan csv [id2#5L] Batched: false, Format: CSV, 
Location: InMemoryFileIndex[file:...ds2.csv], PartitionFilters: [], 
PushedFilters: [IsNotNull(id2)], ReadSchema: struct
{noformat}
The extra exchange is reflected in the run time of the query.

My tests were on inputs with more than 2 million records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18630) PySpark ML memory leak

2018-03-01 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382886#comment-16382886
 ] 

yogesh garg commented on SPARK-18630:
-

After some discussion, I think it makes sense to move just the __del__ method 
to JavaWrapper and keep the copy method in JavaParams. The code also needs some 
testing.

> PySpark ML memory leak
> --
>
> Key: SPARK-18630
> URL: https://issues.apache.org/jira/browse/SPARK-18630
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it 
> would be good to follow up and address the potential memory leak for all 
> items handled by the `JavaWrapper`, not just `JavaParams`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23559) add epoch ID to data writer factory

2018-03-01 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23559:
---

 Summary: add epoch ID to data writer factory
 Key: SPARK-23559
 URL: https://issues.apache.org/jira/browse/SPARK-23559
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has to 
be specifiable at DataWriter creation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23558) clean up StreamWriter factory lifecycle

2018-03-01 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23558:
---

 Summary: clean up StreamWriter factory lifecycle
 Key: SPARK-23558
 URL: https://issues.apache.org/jira/browse/SPARK-23558
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


Right now, StreamWriter and children have different lifecycles in continuous 
processing and microbatch mode. Both execution modes impose significant 
constraints on what that lifecycle must be, so the achievable consistent 
semantic is:
 * StreamWriter lasts for the duration of the query execution
 * DataWriterFactory lasts for the duration of the query execution
 * DataWriter (the task-level writer) has a lifecycle tied to each individual 
epoch

This also allows us to restore the implicit semantic that 
DataWriter.commit()/abort() terminates the lifecycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23557) design doc for read side

2018-03-01 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23557:
---

 Summary: design doc for read side
 Key: SPARK-23557
 URL: https://issues.apache.org/jira/browse/SPARK-23557
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23556) design doc for write side

2018-03-01 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23556:
---

 Summary: design doc for write side
 Key: SPARK-23556
 URL: https://issues.apache.org/jira/browse/SPARK-23556
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382719#comment-16382719
 ] 

Apache Spark commented on SPARK-18844:
--

User 'sandecho' has created a pull request for this issue:
https://github.com/apache/spark/pull/20709

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23555) Add BinaryType support for Arrow in PySpark

2018-03-01 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382715#comment-16382715
 ] 

Bryan Cutler commented on SPARK-23555:
--

I'm working on it

> Add BinaryType support for Arrow in PySpark
> ---
>
> Key: SPARK-23555
> URL: https://issues.apache.org/jira/browse/SPARK-23555
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> BinaryType is supported with Arrow in Scala, but not yet in Python.  It 
> should be added for all codepaths using Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23555) Add BinaryType support for Arrow in PySpark

2018-03-01 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23555:


 Summary: Add BinaryType support for Arrow in PySpark
 Key: SPARK-23555
 URL: https://issues.apache.org/jira/browse/SPARK-23555
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


BinaryType is supported with Arrow in Scala, but not yet in Python.  It should 
be added for all codepaths using Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:26 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as strings because nothing else would make sense in 
JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:24 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's now way to distinguish a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's now way to distinguish a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot commented on SPARK-23520:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23554) Hive's textinputformat.record.delimiter equivalent in Spark

2018-03-01 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-23554:
-

 Summary: Hive's textinputformat.record.delimiter equivalent in 
Spark
 Key: SPARK-23554
 URL: https://issues.apache.org/jira/browse/SPARK-23554
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.1
Reporter: Ruslan Dautkhanov


It would be great if Spark would support an option similar to Hive's 
{{textinputformat.record.delimiter }} in spark-csv reader.

We currently have to create Hive tables to workaround this missing 
functionality natively in Spark.

{{textinputformat.record.delimiter}} was introduced back in 2011 in map-reduce 
era -
 see MAPREDUCE-2254.

As an example, one of the most common use cases for us involving 
{{textinputformat.record.delimiter}} is to read multiple lines of text that 
make up a "record". Number of actual lines per "record" is varying and so 
{{textinputformat.record.delimiter}} is a great solution for us to process 
these files natively in Hadoop/Spark (custom .map() function then actually does 
processing of those records), and we convert it to a dataframe.. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21209) Implement Incremental PCA algorithm for ML

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382537#comment-16382537
 ] 

Apache Spark commented on SPARK-21209:
--

User 'sandecho' has created a pull request for this issue:
https://github.com/apache/spark/pull/20708

> Implement Incremental PCA algorithm for ML
> --
>
> Key: SPARK-21209
> URL: https://issues.apache.org/jira/browse/SPARK-21209
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ben St. Clair
>Priority: Major
>  Labels: features
>
> Incremental Principal Component Analysis is a method for calculating PCAs in 
> an incremental fashion, allowing one to update an existing PCA model as new 
> evidence arrives. Furthermore, an alpha parameter can be used to enable 
> task-specific weighting of new and old evidence.
> This algorithm would be useful for streaming applications, where a fast and 
> adaptive feature subspace calculation could be applied. Furthermore, it can 
> be applied to combine PCAs from subcomponents of large datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23551:
--
Priority: Minor  (was: Major)

> Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
> --
>
> Key: SPARK-23551
> URL: https://issues.apache.org/jira/browse/SPARK-23551
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to prevent `orc-mapreduce` dependency from making IDEs and 
> maven confused.
> *BEFORE*
> Please note that 2.6.4 at Spark Project SQL.
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
> [INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
> {code}
> *AFTER*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21209) Implement Incremental PCA algorithm for ML

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21209:


Assignee: (was: Apache Spark)

> Implement Incremental PCA algorithm for ML
> --
>
> Key: SPARK-21209
> URL: https://issues.apache.org/jira/browse/SPARK-21209
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ben St. Clair
>Priority: Major
>  Labels: features
>
> Incremental Principal Component Analysis is a method for calculating PCAs in 
> an incremental fashion, allowing one to update an existing PCA model as new 
> evidence arrives. Furthermore, an alpha parameter can be used to enable 
> task-specific weighting of new and old evidence.
> This algorithm would be useful for streaming applications, where a fast and 
> adaptive feature subspace calculation could be applied. Furthermore, it can 
> be applied to combine PCAs from subcomponents of large datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21209) Implement Incremental PCA algorithm for ML

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382501#comment-16382501
 ] 

Apache Spark commented on SPARK-21209:
--

User 'sandecho' has created a pull request for this issue:
https://github.com/apache/spark/pull/20707

> Implement Incremental PCA algorithm for ML
> --
>
> Key: SPARK-21209
> URL: https://issues.apache.org/jira/browse/SPARK-21209
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ben St. Clair
>Priority: Major
>  Labels: features
>
> Incremental Principal Component Analysis is a method for calculating PCAs in 
> an incremental fashion, allowing one to update an existing PCA model as new 
> evidence arrives. Furthermore, an alpha parameter can be used to enable 
> task-specific weighting of new and old evidence.
> This algorithm would be useful for streaming applications, where a fast and 
> adaptive feature subspace calculation could be applied. Furthermore, it can 
> be applied to combine PCAs from subcomponents of large datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21209) Implement Incremental PCA algorithm for ML

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21209:


Assignee: Apache Spark

> Implement Incremental PCA algorithm for ML
> --
>
> Key: SPARK-21209
> URL: https://issues.apache.org/jira/browse/SPARK-21209
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ben St. Clair
>Assignee: Apache Spark
>Priority: Major
>  Labels: features
>
> Incremental Principal Component Analysis is a method for calculating PCAs in 
> an incremental fashion, allowing one to update an existing PCA model as new 
> evidence arrives. Furthermore, an alpha parameter can be used to enable 
> task-specific weighting of new and old evidence.
> This algorithm would be useful for streaming applications, where a fast and 
> adaptive feature subspace calculation could be applied. Furthermore, it can 
> be applied to combine PCAs from subcomponents of large datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23550) Cleanup unused / redundant methods in Utils object

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23550:


Assignee: Apache Spark

> Cleanup unused / redundant methods in Utils object
> --
>
> Key: SPARK-23550
> URL: https://issues.apache.org/jira/browse/SPARK-23550
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Trivial
>
> While looking at some code in {{Utils}} for a different purpose, I noticed a 
> bunch of code there that can be removed or otherwise cleaned up.
> I'll send a PR after I run unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23550) Cleanup unused / redundant methods in Utils object

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382478#comment-16382478
 ] 

Apache Spark commented on SPARK-23550:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20706

> Cleanup unused / redundant methods in Utils object
> --
>
> Key: SPARK-23550
> URL: https://issues.apache.org/jira/browse/SPARK-23550
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> While looking at some code in {{Utils}} for a different purpose, I noticed a 
> bunch of code there that can be removed or otherwise cleaned up.
> I'll send a PR after I run unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23550) Cleanup unused / redundant methods in Utils object

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23550:


Assignee: (was: Apache Spark)

> Cleanup unused / redundant methods in Utils object
> --
>
> Key: SPARK-23550
> URL: https://issues.apache.org/jira/browse/SPARK-23550
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> While looking at some code in {{Utils}} for a different purpose, I noticed a 
> bunch of code there that can be removed or otherwise cleaned up.
> I'll send a PR after I run unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23553) Tests should not assume the default value of `spark.sql.sources.default`

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23553:


Assignee: (was: Apache Spark)

> Tests should not assume the default value of `spark.sql.sources.default`
> 
>
> Key: SPARK-23553
> URL: https://issues.apache.org/jira/browse/SPARK-23553
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, some tests have an assumption that 
> `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, 
> but that assumption makes it difficult to test new data source format. This 
> issue improves test suites more robust and makes it easy to test new data 
> sources. As an example, the PR will use `spark.sql.sources.default=orc` 
> during reviews.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23553) Tests should not assume the default value of `spark.sql.sources.default`

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382448#comment-16382448
 ] 

Apache Spark commented on SPARK-23553:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20705

> Tests should not assume the default value of `spark.sql.sources.default`
> 
>
> Key: SPARK-23553
> URL: https://issues.apache.org/jira/browse/SPARK-23553
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, some tests have an assumption that 
> `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, 
> but that assumption makes it difficult to test new data source format. This 
> issue improves test suites more robust and makes it easy to test new data 
> sources. As an example, the PR will use `spark.sql.sources.default=orc` 
> during reviews.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23553) Tests should not assume the default value of `spark.sql.sources.default`

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23553:


Assignee: Apache Spark

> Tests should not assume the default value of `spark.sql.sources.default`
> 
>
> Key: SPARK-23553
> URL: https://issues.apache.org/jira/browse/SPARK-23553
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Currently, some tests have an assumption that 
> `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, 
> but that assumption makes it difficult to test new data source format. This 
> issue improves test suites more robust and makes it easy to test new data 
> sources. As an example, the PR will use `spark.sql.sources.default=orc` 
> during reviews.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23553) Tests should not assume the default value of `spark.sql.sources.default`

2018-03-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23553:
--
Description: Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format. This issue 
improves test suites more robust and makes it easy to test new data sources. As 
an example, the PR will use `spark.sql.sources.default=orc` during reviews.  
(was: Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format. This issue 
improves test suites more robust and makes it easy to test new data sources. As 
an example, I will change )

> Tests should not assume the default value of `spark.sql.sources.default`
> 
>
> Key: SPARK-23553
> URL: https://issues.apache.org/jira/browse/SPARK-23553
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, some tests have an assumption that 
> `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, 
> but that assumption makes it difficult to test new data source format. This 
> issue improves test suites more robust and makes it easy to test new data 
> sources. As an example, the PR will use `spark.sql.sources.default=orc` 
> during reviews.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23553) Tests should not assume the default value of `spark.sql.sources.default`

2018-03-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23553:
--
Description: Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format. This issue 
improves test suites more robust and makes it easy to test new data sources. As 
an example, I will change   (was: Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format. This issue 
improves test suites more robust and makes it easy to test new data sources.)

> Tests should not assume the default value of `spark.sql.sources.default`
> 
>
> Key: SPARK-23553
> URL: https://issues.apache.org/jira/browse/SPARK-23553
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Currently, some tests have an assumption that 
> `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, 
> but that assumption makes it difficult to test new data source format. This 
> issue improves test suites more robust and makes it easy to test new data 
> sources. As an example, I will change 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23553) Tests should not assume the default value of `spark.sql.sources.default`

2018-03-01 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23553:
-

 Summary: Tests should not assume the default value of 
`spark.sql.sources.default`
 Key: SPARK-23553
 URL: https://issues.apache.org/jira/browse/SPARK-23553
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


Currently, some tests have an assumption that 
`spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but 
that assumption makes it difficult to test new data source format. This issue 
improves test suites more robust and makes it easy to test new data sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23551:
--
Description: 
This issue aims to prevent `orc-mapreduce` dependency from making IDEs and 
maven confused.

*BEFORE*
Please note that 2.6.4 at Spark Project SQL.
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
{code}

*AFTER*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
{code}

  was:
This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
confused.

*BEFORE*
Please note that 2.6.4 at Spark Project SQL.
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
{code}

*AFTER*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building 

[jira] [Created] (SPARK-23552) Dataset.withColumn does not allow overriding of a struct field

2018-03-01 Thread David Capwell (JIRA)
David Capwell created SPARK-23552:
-

 Summary: Dataset.withColumn does not allow overriding of a struct 
field
 Key: SPARK-23552
 URL: https://issues.apache.org/jira/browse/SPARK-23552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: David Capwell


We have a Dataset with a schema such as the following: 

 

 
{code:java}
struct>
{code}
 

 

if we do the following, one would expect to override the type of bar, but 
instead a new column gets added

 

 
{code:java}
ds.withColumn("foo.bar", ...){code}
 

This produces the following schema

 
{code:java}
struct, foo.bar: ...>{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23543) Automatic Module creation fails in Java 9

2018-03-01 Thread Brian D Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382399#comment-16382399
 ] 

Brian D Chambers commented on SPARK-23543:
--

Note  this is not an implementation of java 9 modules.  It's just a piece of 
metadata that allows a non-modular dependency to work within a modular project.

> Automatic Module creation fails in Java 9
> -
>
> Key: SPARK-23543
> URL: https://issues.apache.org/jira/browse/SPARK-23543
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
> Environment: maven + jdk9 + project based on jdk9 module system
>Reporter: Brian D Chambers
>Priority: Major
>
> When adding Spark to a Java 9 project that is utilizing the new jdk9 module 
> system, Spark components cannot be used because the automatic module names 
> that are generated by the jdk9 fail if the artifact has digits in what would 
> become the beginning of an identifier.  The jdk cannot generate an automatic 
> name for the Spark module, resulting in Spark being unusable from within a 
> java module.
> This problem can also be validated/tested on the command line against any 
> Spark jar, e.g.
> {panel:title=jar --file=spark-graphx_2.11-2.3.0.jar --describe-module}
> Unable to derive module descriptor for: spark-graphx_2.11-2.3.0.jar 
> spark.graphx.2.11: Invalid module name: '2' is not a Java identifier
> {panel}
> Spark does not have to support jdk9 modules to fix this issue.  It just needs 
> to add a line of metadata to its manifest so the jdk can generate a valid 
> automatic name.
> the following would be sufficient to fix the issue in spark.graphx
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-jar-plugin
>   
> 
>   
> spark.graphx
>   
> 
>   
> 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23551:
--
Description: 
This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
confused.

*BEFORE*
Please note that 2.6.4 at Spark Project SQL.
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
{code}

*AFTER*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
{code}

  was:
This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
confused.

*BEFORE*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
{code}

*AFTER*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 

[jira] [Assigned] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23551:


Assignee: Apache Spark

> Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
> --
>
> Key: SPARK-23551
> URL: https://issues.apache.org/jira/browse/SPARK-23551
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
> confused.
> *BEFORE*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
> [INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
> {code}
> *AFTER*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23551:


Assignee: (was: Apache Spark)

> Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
> --
>
> Key: SPARK-23551
> URL: https://issues.apache.org/jira/browse/SPARK-23551
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
> confused.
> *BEFORE*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
> [INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
> {code}
> *AFTER*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382392#comment-16382392
 ] 

Apache Spark commented on SPARK-23551:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20704

> Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`
> --
>
> Key: SPARK-23551
> URL: https://issues.apache.org/jira/browse/SPARK-23551
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
> confused.
> *BEFORE*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
> [INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
> {code}
> *AFTER*
> {code}
> $ mvn dependency:tree -Phadoop-2.7 
> -Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
> ...
> [INFO] 
> 
> [INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
> spark-catalyst_2.11 ---
> [INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 
> ---
> [INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
> [INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
> [INFO]   \- 
> org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-03-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23471.
---
Resolution: Cannot Reproduce

I'll close this for now, but please say if it's actually an issue.   Thank you!

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23551) Exclude `hadoop-mapreduce-client-core` dependency from `orc-mapreduce`

2018-03-01 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23551:
-

 Summary: Exclude `hadoop-mapreduce-client-core` dependency from 
`orc-mapreduce`
 Key: SPARK-23551
 URL: https://issues.apache.org/jira/browse/SPARK-23551
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


This issue aims to prevent `orc-mapreduce` dependency makes IDEs and maven 
confused.

*BEFORE*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.orc:orc-mapreduce:jar:nohive:1.4.3:compile
[INFO]\- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.4:compile
{code}

*AFTER*
{code}
$ mvn dependency:tree -Phadoop-2.7 
-Dincludes=org.apache.hadoop:hadoop-mapreduce-client-core
...
[INFO] 
[INFO] Building Spark Project Catalyst 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ 
spark-catalyst_2.11 ---
[INFO] org.apache.spark:spark-catalyst_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
[INFO]
[INFO] 
[INFO] Building Spark Project SQL 2.4.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) @ spark-sql_2.11 ---
[INFO] org.apache.spark:spark-sql_2.11:jar:2.4.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-core_2.11:jar:2.4.0-SNAPSHOT:compile
[INFO]\- org.apache.hadoop:hadoop-client:jar:2.7.3:compile
[INFO]   \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.3:compile
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23550) Cleanup unused / redundant methods in Utils object

2018-03-01 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-23550:
--

 Summary: Cleanup unused / redundant methods in Utils object
 Key: SPARK-23550
 URL: https://issues.apache.org/jira/browse/SPARK-23550
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


While looking at some code in {{Utils}} for a different purpose, I noticed a 
bunch of code there that can be removed or otherwise cleaned up.

I'll send a PR after I run unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf

2018-03-01 Thread Riccardo Vincelli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riccardo Vincelli updated SPARK-10908:
--
Comment: was deleted

(was: Hi, I am encountering this as well, running two local Spark contexts on 
the same JVM while pouring an RDD into the local memory with a 
{{toLocalIterator}}. This might be just inherently incompatible configuration, 
but it is weird. Thanks,)

> ClassCastException in HadoopRDD.getJobConf
> --
>
> Key: SPARK-10908
> URL: https://issues.apache.org/jira/browse/SPARK-10908
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Franco
>Priority: Major
>
> Whilst running a Spark SQL job (I can't provide an explain plan as many of 
> these are happening concurrently) the following exception is thrown:
> java.lang.ClassCastException: [B cannot be cast to 
> org.apache.spark.util.SerializableConfiguration
> rg.apache.spark.util.SerializableConfiguration
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf

2018-03-01 Thread Riccardo Vincelli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382241#comment-16382241
 ] 

Riccardo Vincelli edited comment on SPARK-10908 at 3/1/18 4:19 PM:
---

Hi, I am encountering this as well, running two local Spark contexts on the 
same JVM while pouring an RDD into the local memory with a `toLocalIterator`. 
This might be just inherently incompatible configuration, but it is weird. 
Thanks,


was (Author: rvincelli):
Hi, I am encountering this as well, running two local Spark contexts on the 
same JVM while performing HDFS operations. This might be just inherently 
incompatible configuration, but it is weird. Thanks,

> ClassCastException in HadoopRDD.getJobConf
> --
>
> Key: SPARK-10908
> URL: https://issues.apache.org/jira/browse/SPARK-10908
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Franco
>Priority: Major
>
> Whilst running a Spark SQL job (I can't provide an explain plan as many of 
> these are happening concurrently) the following exception is thrown:
> java.lang.ClassCastException: [B cannot be cast to 
> org.apache.spark.util.SerializableConfiguration
> rg.apache.spark.util.SerializableConfiguration
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf

2018-03-01 Thread Riccardo Vincelli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382241#comment-16382241
 ] 

Riccardo Vincelli edited comment on SPARK-10908 at 3/1/18 4:19 PM:
---

Hi, I am encountering this as well, running two local Spark contexts on the 
same JVM while pouring an RDD into the local memory with a {{toLocalIterator}}. 
This might be just inherently incompatible configuration, but it is weird. 
Thanks,


was (Author: rvincelli):
Hi, I am encountering this as well, running two local Spark contexts on the 
same JVM while pouring an RDD into the local memory with a `toLocalIterator`. 
This might be just inherently incompatible configuration, but it is weird. 
Thanks,

> ClassCastException in HadoopRDD.getJobConf
> --
>
> Key: SPARK-10908
> URL: https://issues.apache.org/jira/browse/SPARK-10908
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Franco
>Priority: Major
>
> Whilst running a Spark SQL job (I can't provide an explain plan as many of 
> these are happening concurrently) the following exception is thrown:
> java.lang.ClassCastException: [B cannot be cast to 
> org.apache.spark.util.SerializableConfiguration
> rg.apache.spark.util.SerializableConfiguration
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf

2018-03-01 Thread Riccardo Vincelli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382241#comment-16382241
 ] 

Riccardo Vincelli commented on SPARK-10908:
---

Hi, I am encountering this as well, running two local Spark contexts on the 
same JVM while performing HDFS operations. This might be just inherently 
incompatible configuration, but it is weird. Thanks,

> ClassCastException in HadoopRDD.getJobConf
> --
>
> Key: SPARK-10908
> URL: https://issues.apache.org/jira/browse/SPARK-10908
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Franco
>Priority: Major
>
> Whilst running a Spark SQL job (I can't provide an explain plan as many of 
> these are happening concurrently) the following exception is thrown:
> java.lang.ClassCastException: [B cannot be cast to 
> org.apache.spark.util.SerializableConfiguration
> rg.apache.spark.util.SerializableConfiguration
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23010) Add integration testing for Kubernetes backend into the apache/spark repository

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23010:


Assignee: (was: Apache Spark)

> Add integration testing for Kubernetes backend into the apache/spark 
> repository
> ---
>
> Key: SPARK-23010
> URL: https://issues.apache.org/jira/browse/SPARK-23010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Add tests for the scheduler backend into apache/spark
> /xref: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23010) Add integration testing for Kubernetes backend into the apache/spark repository

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23010:


Assignee: Apache Spark

> Add integration testing for Kubernetes backend into the apache/spark 
> repository
> ---
>
> Key: SPARK-23010
> URL: https://issues.apache.org/jira/browse/SPARK-23010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Apache Spark
>Priority: Major
>
> Add tests for the scheduler backend into apache/spark
> /xref: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23010) Add integration testing for Kubernetes backend into the apache/spark repository

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382233#comment-16382233
 ] 

Apache Spark commented on SPARK-23010:
--

User 'ssuchter' has created a pull request for this issue:
https://github.com/apache/spark/pull/20697

> Add integration testing for Kubernetes backend into the apache/spark 
> repository
> ---
>
> Key: SPARK-23010
> URL: https://issues.apache.org/jira/browse/SPARK-23010
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Add tests for the scheduler backend into apache/spark
> /xref: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23405) The task will hang up when a small table left semi join a big table

2018-03-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23405:
---

Assignee: KaiXinXIaoLei

> The task will hang up when a small table left semi join a big table
> ---
>
> Key: SPARK-23405
> URL: https://issues.apache.org/jira/browse/SPARK-23405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Assignee: KaiXinXIaoLei
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SQL.png, taskhang up.png
>
>
> # I run a sql: `select ls.cs_order_number from ls left semi join 
> catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table 
> is a small table ,and the number is one. The `catalog_sales` table is a big 
> table,  and the number is 10 billion. The task will be hang up:
> !taskhang up.png!
>  And the sql page is :
> !SQL.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23405) The task will hang up when a small table left semi join a big table

2018-03-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23405.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20670
[https://github.com/apache/spark/pull/20670]

> The task will hang up when a small table left semi join a big table
> ---
>
> Key: SPARK-23405
> URL: https://issues.apache.org/jira/browse/SPARK-23405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SQL.png, taskhang up.png
>
>
> # I run a sql: `select ls.cs_order_number from ls left semi join 
> catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table 
> is a small table ,and the number is one. The `catalog_sales` table is a big 
> table,  and the number is 10 billion. The task will be hang up:
> !taskhang up.png!
>  And the sql page is :
> !SQL.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19185:


Assignee: (was: Apache Spark)

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382119#comment-16382119
 ] 

Apache Spark commented on SPARK-19185:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20703

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> 

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19185:


Assignee: Apache Spark

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Apache Spark
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at 

[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-03-01 Thread Ameen Tayyebi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382117#comment-16382117
 ] 

Ameen Tayyebi commented on SPARK-23443:
---

Great, thank you so much. I've been stuck with bunch of other tasks at work
unfortunately. I think I'll be able to pick this up again in 2-3 weeks.
I'll try to get a small change out so that we can start iterating on it.




> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date

2018-03-01 Thread Dong Jiang (JIRA)
Dong Jiang created SPARK-23549:
--

 Summary: Spark SQL unexpected behavior when comparing timestamp to 
date
 Key: SPARK-23549
 URL: https://issues.apache.org/jira/browse/SPARK-23549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Dong Jiang


{code:java}
scala> spark.version

res1: String = 2.2.1

scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
cast('2017-02-28' as date) and cast('2017-03-01' as date)").show


+---+

|((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 AS 
TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|

+---+

|                                                                               
                                                                                
                                           false|

+---+{code}
As shown above, when a timestamp is compared to date in SparkSQL, both 
timestamp and date are downcast to string, and leading to unexpected result. If 
run the same SQL in presto/Athena, I got the expected result
{code:java}
select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as 
date) and cast('2017-03-01' as date)
    _col0
1   true{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date

2018-03-01 Thread Dong Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Jiang updated SPARK-23549:
---
Description: 
{code:java}
scala> spark.version

res1: String = 2.2.1

scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
cast('2017-02-28' as date) and cast('2017-03-01' as date)").show


+---+

|((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 AS 
TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|

+---+

|                                                                               
                                                                                
                                           false|

+---+{code}
As shown above, when a timestamp is compared to date in SparkSQL, both 
timestamp and date are downcast to string, and leading to unexpected result. If 
run the same SQL in presto/Athena, I got the expected result
{code:java}
select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as 
date) and cast('2017-03-01' as date)
    _col0
1   true
{code}

Is this a bug for Spark or a feature?

  was:
{code:java}
scala> spark.version

res1: String = 2.2.1

scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
cast('2017-02-28' as date) and cast('2017-03-01' as date)").show


+---+

|((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 AS 
TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|

+---+

|                                                                               
                                                                                
                                           false|

+---+{code}
As shown above, when a timestamp is compared to date in SparkSQL, both 
timestamp and date are downcast to string, and leading to unexpected result. If 
run the same SQL in presto/Athena, I got the expected result
{code:java}
select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as 
date) and cast('2017-03-01' as date)
    _col0
1   true{code}


> Spark SQL unexpected behavior when comparing timestamp to date
> --
>
> Key: SPARK-23549
> URL: https://issues.apache.org/jira/browse/SPARK-23549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dong Jiang
>Priority: Major
>
> {code:java}
> scala> spark.version
> res1: String = 2.2.1
> scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
> cast('2017-02-28' as date) and cast('2017-03-01' as date)").show
> +---+
> |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
> CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 
> AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|
> +---+
> |                                                                             
>                                                                               
>                                        

[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-03-01 Thread Devin Boyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382085#comment-16382085
 ] 

Devin Boyer commented on SPARK-23443:
-

I would also be interested in helping if needed, or certainly testing this!

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23548) Redirect loop from Resourcemanager to Spark Webui

2018-03-01 Thread Dieter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dieter resolved SPARK-23548.

Resolution: Not A Problem

resolved. was name resolution issue

> Redirect loop from Resourcemanager to Spark Webui
> -
>
> Key: SPARK-23548
> URL: https://issues.apache.org/jira/browse/SPARK-23548
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
> Environment: *SW*
> hadoop-3.0.0
> Spark 2.3.0 (git revision a0d7949896) built for Hadoop 2.7.3
>  
> *Nodes*
> jupyter
> proxyserver
> historyserver
> resourcemanager
> nodemanager ([http://9d581cfb391f:8042|http://9d581cfb391f:8042/])
>  
> *Config of sparkcontext*
> spark.app.id=application_1519904229414_0011
> spark.app.name=Spark shell
> spark.driver.appUIAddress=http://jupyter:4040
> spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/tmp/driver_log4j.properties
> spark.driver.host=jupyter
> spark.driver.port=35813
> spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/driver_log4j.properties
> spark.executor.id=driver
> spark.home=/usr/local/spark
> spark.jars=
> spark.logConf=true
> spark.master=yarn
> spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=proxyserver
> spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=[http://proxyserver:8089/proxy/application_1519904229414_0011]
>  
>  
> *yarn-site.xml*
> 
>  
>  
>  yarn.web-proxy.address
>  proxyserver:8089
>  
>  
>  yarn.resourcemanager.fs.state-store.uri
>  /rmstate
>  
>  
>  yarn.resourcemanager.webapp.address
>  resourcemanager:8088
>  
>  
>  yarn.timeline-service.generic-application-history.enabled
>  true
>  
>  
>  yarn.resourcemanager.recovery.enabled
>  true
>  
>  
>  yarn.nodemanager.bind-host
>  0.0.0.0
>  
>  
>  yarn.timeline-service.bind-host
>  0.0.0.0
>  
>  
>  yarn.log-aggregation-enable
>  true
>  
>  
>  yarn.timeline-service.enabled
>  true
>  
>  
>  yarn.resourcemanager.store.class
>  
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
>  
>  
>  yarn.resourcemanager.system-metrics-publisher.enabled
>  true
>  
>  
>  yarn.nodemanager.remote-app-log-dir
>  /app-logs
>  
>  
>  yarn.nodemanager.vmem-pmem-ratio
>  4.1
>  
>  
>  yarn.resourcemanager.resource-tracker.address
>  resourcemanager:8031
>  
>  
>  yarn.resourcemanager.hostname
>  resourcemanager
>  
>  
>  yarn.timeline-service.hostname
>  historyserver
>  
>  
>  yarn.log.server.url
>  http://historyserver:8188/applicationhistory/logs/
>  
>  
>  yarn.resourcemanager.bind-host
>  0.0.0.0
>  
>  
>  yarn.resourcemanager.scheduler.address
>  resourcemanager:8030
>  
>  
>  yarn.resourcemanager.address
>  resourcemanager:8032
>  
> 
>Reporter: Dieter
>Priority: Major
>
> Access from Resourcemanager to Webui in a distributed environment results in 
> a redirect loop. It seems webui is not recognizing request is coming from 
> resourcemanager proxy.
>  
> *Situation*
> Spark application running on jupyter
>  
> *link in resourcemanger:8088 (behind Trackingui->Applicationmaster)*
> [http://proxyserver:8089/proxy/application_1519904229414_0011/]
>  
> *Response of webui on jupyter:4040 (redirects to proxyserver)*
> {{}}
> {{  }}
> {{      Moved              Moved    
>       Content has moved   href="http://proxyserver:8089/proxy/application_1519904229414_0011/;>here 
>      }}
> {{  }}
> {{}}{{ }}
>  
> *log output of proxyserver (results in redirect)*
> March 1st 2018, 12:38:50.000 
> /hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver 
> stderr | 2018-03-01 11:38:49,905 DEBUG strategy.ExecuteProduceConsume: EPC 
> Prod/org.eclipse.jetty.io.ManagedSelector$SelectorProducer@5c4d391a producing 
> 
> March 1st 2018, 12:38:50.000 
> /hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver 
> stderr | 2018-03-01 11:38:49,908 DEBUG protocol.RequestAuthCache: Auth cache 
> not set in the context
> March 1st 2018, 12:38:50.000 
> /hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver 
> stderr | 2018-03-01 11:38:49,908 DEBUG http.wire: >> "[\r][\n]"
> March 1st 2018, 12:38:50.000 
> /hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver 
> stderr | 2018-03-01 11:38:49,908 DEBUG http.headers: >> GET / HTTP/1.1
>   
> March 1st 2018, 12:38:50.000 
> /hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver 
> stderr | 2018-03-01 11:38:49,908 DEBUG http.headers: >> Accept: 
> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
>  
> March 1st 2018, 12:38:50.000 
> /hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver 
> stderr | 2018-03-01 

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-03-01 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382034#comment-16382034
 ] 

Gabor Somogyi commented on SPARK-19185:
---

Same problem exists in structured streaming. Creating a new PR with similar 
work-around.

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> 

[jira] [Created] (SPARK-23548) Redirect loop from Resourcemanager to Spark Webui

2018-03-01 Thread Dieter (JIRA)
Dieter created SPARK-23548:
--

 Summary: Redirect loop from Resourcemanager to Spark Webui
 Key: SPARK-23548
 URL: https://issues.apache.org/jira/browse/SPARK-23548
 Project: Spark
  Issue Type: Bug
  Components: Web UI, YARN
Affects Versions: 2.3.0
 Environment: *SW*

hadoop-3.0.0
Spark 2.3.0 (git revision a0d7949896) built for Hadoop 2.7.3

 

*Nodes*

jupyter
proxyserver
historyserver
resourcemanager
nodemanager ([http://9d581cfb391f:8042|http://9d581cfb391f:8042/])

 

*Config of sparkcontext*

spark.app.id=application_1519904229414_0011

spark.app.name=Spark shell

spark.driver.appUIAddress=http://jupyter:4040

spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/tmp/driver_log4j.properties

spark.driver.host=jupyter

spark.driver.port=35813

spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/driver_log4j.properties

spark.executor.id=driver

spark.home=/usr/local/spark

spark.jars=

spark.logConf=true

spark.master=yarn

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=proxyserver

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=[http://proxyserver:8089/proxy/application_1519904229414_0011]
 

 

*yarn-site.xml*


 
 
 yarn.web-proxy.address
 proxyserver:8089
 
 
 yarn.resourcemanager.fs.state-store.uri
 /rmstate
 
 
 yarn.resourcemanager.webapp.address
 resourcemanager:8088
 
 
 yarn.timeline-service.generic-application-history.enabled
 true
 
 
 yarn.resourcemanager.recovery.enabled
 true
 
 
 yarn.nodemanager.bind-host
 0.0.0.0
 
 
 yarn.timeline-service.bind-host
 0.0.0.0
 
 
 yarn.log-aggregation-enable
 true
 
 
 yarn.timeline-service.enabled
 true
 
 
 yarn.resourcemanager.store.class
 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
 
 
 yarn.resourcemanager.system-metrics-publisher.enabled
 true
 
 
 yarn.nodemanager.remote-app-log-dir
 /app-logs
 
 
 yarn.nodemanager.vmem-pmem-ratio
 4.1
 
 
 yarn.resourcemanager.resource-tracker.address
 resourcemanager:8031
 
 
 yarn.resourcemanager.hostname
 resourcemanager
 
 
 yarn.timeline-service.hostname
 historyserver
 
 
 yarn.log.server.url
 http://historyserver:8188/applicationhistory/logs/
 
 
 yarn.resourcemanager.bind-host
 0.0.0.0
 
 
 yarn.resourcemanager.scheduler.address
 resourcemanager:8030
 
 
 yarn.resourcemanager.address
 resourcemanager:8032
 

Reporter: Dieter


Access from Resourcemanager to Webui in a distributed environment results in a 
redirect loop. It seems webui is not recognizing request is coming from 
resourcemanager proxy.

 

*Situation*

Spark application running on jupyter

 

*link in resourcemanger:8088 (behind Trackingui->Applicationmaster)*

[http://proxyserver:8089/proxy/application_1519904229414_0011/]

 

*Response of webui on jupyter:4040 (redirects to proxyserver)*

{{}}
{{  }}
{{      Moved              Moved    
      Content has moved  http://proxyserver:8089/proxy/application_1519904229414_0011/;>here   
   }}
{{  }}
{{}}{{ }}

 

*log output of proxyserver (results in redirect)*

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,905 DEBUG strategy.ExecuteProduceConsume: EPC 
Prod/org.eclipse.jetty.io.ManagedSelector$SelectorProducer@5c4d391a producing   
  

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,908 DEBUG protocol.RequestAuthCache: Auth cache not set 
in the context

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,908 DEBUG http.wire: >> "[\r][\n]"

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,908 DEBUG http.headers: >> GET / HTTP/1.1  

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,908 DEBUG http.headers: >> Accept: 
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
 

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,908 DEBUG http.headers: >> Host: jupyter:4040  

March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,911 DEBUG http.headers: << Server: Jetty(9.3.z-SNAPSHOT)  


March 1st 2018, 12:38:50.000 
/hadoop_proxyserver.1.ftfomwpuh9pmpwm87aon122vl proxyserver stderr 
| 2018-03-01 11:38:49,911 DEBUG client.DefaultRedirectStrategy: Redirect 
requested to location 
'http://proxyserver:8089/proxy/application_1519904229414_0001/' 


[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-01 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Description: 
!2018-03-01_202415.png!

 

when the hive session closed, we should also cleanup the .pipeout file.

 

 

  was:
when the hive session closed, we should also cleanup the .pipeout file.

 

 


> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-01_202415.png
>
>
> !2018-03-01_202415.png!
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-01 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Attachment: 2018-03-01_202415.png

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-01_202415.png
>
>
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23547:


Assignee: (was: Apache Spark)

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
>
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381910#comment-16381910
 ] 

Apache Spark commented on SPARK-23547:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20702

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
>
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23547:


Assignee: Apache Spark

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Major
>
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-01 Thread zuotingbing (JIRA)
zuotingbing created SPARK-23547:
---

 Summary: Cleanup the .pipeout file when the Hive Session closed
 Key: SPARK-23547
 URL: https://issues.apache.org/jira/browse/SPARK-23547
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: zuotingbing


when the hive session closed, we should also cleanup the .pipeout file.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381883#comment-16381883
 ] 

Hyukjin Kwon commented on SPARK-23520:
--

Is it roughly a duplicate of SPARK-21651?

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23528) Expose vital statistics of GaussianMixtureModel

2018-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23528:


Assignee: (was: Apache Spark)

> Expose vital statistics of GaussianMixtureModel
> ---
>
> Key: SPARK-23528
> URL: https://issues.apache.org/jira/browse/SPARK-23528
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Erich Schubert
>Priority: Minor
>
> Spark ML should expose vital statistics of the GMM model:
>  * *Number of iterations* (actual, not max) until the tolerance threshold was 
> hit: we can set a maximum, but how do we know the limit was large enough, and 
> how many iterations it really took?
>  * Final *log likelihood* of the model: if we run multiple times with 
> different starting conditions, how do we know which run converged to the 
> better fit?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >