from:"Sameer Agarwal \(JIRA\)"

[jira] [Updated] (SPARK-21804) json_tuple returns null values within repeated columns except the first one

2019-10-17 Thread Sameer Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-21804:
---
Affects Version/s: (was: 2.2.0)
   2.0.0

> json_tuple returns null values within repeated columns except the first one
> ---
>
> Key: SPARK-21804
> URL: https://issues.apache.org/jira/browse/SPARK-21804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jen-Ming Chung
>Assignee: Jen-Ming Chung
>Priority: Minor
>  Labels: starter
> Fix For: 2.3.0
>
>
> I was testing json_tuple in extracting values from JSON but I found it could 
> actually returns null values within repeated columns except the first one as 
> below:
> {code:language=scala}
> scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
> 'a')""").show()
> +---+---++
> | c0| c1|  c2|
> +---+---++
> |  1|  2|null|
> +---+---++
> {code}
> I think this should be consistent with Hive's implementation:
> {code:language=scala}
> hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
> ...
> 11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23470) org.apache.spark.ui.jobs.ApiHelper.lastStageNameAndDescription is too slow

2018-02-20 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23470.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.3.0

Issue resolved by https://github.com/apache/spark/pull/20644

> org.apache.spark.ui.jobs.ApiHelper.lastStageNameAndDescription is too slow
> --
>
> Key: SPARK-23470
> URL: https://issues.apache.org/jira/browse/SPARK-23470
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> I was testing 2.3.0 RC3 and found that it's easy to hit "read timeout" when 
> accessing All Jobs page. The stack dump says it was running 
> "org.apache.spark.ui.jobs.ApiHelper.lastStageNameAndDescription".
> {code}
> "SparkUI-59" #59 daemon prio=5 os_prio=0 tid=0x7fc15b0a3000 nid=0x8dc 
> runnable [0x7fc0ce9f8000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:154)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:248)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$2(InMemoryStore.java:214)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$36/1834982692.compare(Unknown
>  Source)
>   at java.util.TimSort.binarySort(TimSort.java:296)
>   at java.util.TimSort.sort(TimSort.java:239)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1460)
>   at java.util.stream.SortedOps$RefSortingSink.end(SortedOps.java:387)
>   at java.util.stream.Sink$ChainedReference.end(Sink.java:258)
>   at 
> java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:210)
>   at 
> java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
>   at 
> java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
>   at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.hasNext(InMemoryStore.java:278)
>   at 
> org.apache.spark.status.AppStatusStore.lastStageAttempt(AppStatusStore.scala:101)
>   at 
> org.apache.spark.ui.jobs.ApiHelper$$anonfun$38.apply(StagePage.scala:1014)
>   at 
> org.apache.spark.ui.jobs.ApiHelper$$anonfun$38.apply(StagePage.scala:1014)
>   at 
> org.apache.spark.status.AppStatusStore.asOption(AppStatusStore.scala:408)
>   at 
> org.apache.spark.ui.jobs.ApiHelper$.lastStageNameAndDescription(StagePage.scala:1014)
>   at 
> org.apache.spark.ui.jobs.JobDataSource.org$apache$spark$ui$jobs$JobDataSource$$jobRow(AllJobsPage.scala:434)
>   at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$24.apply(AllJobsPage.scala:412)
>   at 
> org.apache.spark.ui.jobs.JobDataSource$$anonfun$24.apply(AllJobsPage.scala:412)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
>   at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.ui.jobs.JobDataSource.(AllJobsPage.scala:412)
>   at org.apache.spark.ui.jobs.JobPagedTable.(AllJobsPage.scala:504)
>   at org.apache.spark.ui.jobs.AllJobsPage.jobsTable(AllJobsPage.scala:246)
>   at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:295)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
>

[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366537#comment-16366537
 ] 

Sameer Agarwal commented on SPARK-23410:


[~maxgekk] [~smilegator] any ETA on this? As [~hyukjin.kwon] points out, given 
that https://github.com/apache/spark/pull/20302 is reverted, should we still 
block RC4 on this?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Blocker
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23292) python tests related to pandas are skipped with python 2

2018-02-13 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23292:
---
Summary: python tests related to pandas are skipped with python 2  (was: 
python tests related to pandas are skipped)

> python tests related to pandas are skipped with python 2
> 
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Critical
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23388) Support for Parquet Binary DecimalType in VectorizedColumnReader

2018-02-13 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362730#comment-16362730
 ] 

Sameer Agarwal commented on SPARK-23388:


yes, I agree

> Support for Parquet Binary DecimalType in VectorizedColumnReader
> 
>
> Key: SPARK-23388
> URL: https://issues.apache.org/jira/browse/SPARK-23388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: James Thompson
>Assignee: James Thompson
>Priority: Major
> Fix For: 2.3.1
>
>
> The following commit to spark removed support for decimal binary types: 
> [https://github.com/apache/spark/commit/9c29c557635caf739fde942f53255273aac0d7b1#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428]
> As per the parquet spec, decimal can be used to annotate binary types, so 
> support should be re-added: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-11 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360225#comment-16360225
 ] 

Sameer Agarwal edited comment on SPARK-23390 at 2/12/18 2:41 AM:
-

I ran this test locally 50 times and it passed every time. Therefore, I'm 
currently not marking this as a release blocker as this could just be an 
artifact of our test environment (possibly due to the order in which tests are 
run).

Also, cc [~smilegator] [~cloud_fan]


was (Author: sameerag):
I ran this test locally 50 times and it passed every time. Therefore, I'm 
currently not marking this as a release blocker as this could just be an 
artifact of our test environment (possibly due to the order in which tests are 
run).

Also, cc [~LI,Xiao] [~cloud_fan]

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Major
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-11 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360225#comment-16360225
 ] 

Sameer Agarwal commented on SPARK-23390:


I ran this test locally 50 times and it passed every time. Therefore, I'm 
currently not marking this as a release blocker as this could just be an 
artifact of our test environment (possibly due to the order in which tests are 
run).

Also, cc [~LI,Xiao] [~cloud_fan]

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Major
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-11 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-23390:
--

 Summary: Flaky Test Suite: FileBasedDataSourceSuite in Spark 
2.3/hadoop 2.7
 Key: SPARK-23390
 URL: https://issues.apache.org/jira/browse/SPARK-23390
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Sameer Agarwal


We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
{{spark-branch-2.3-test-sbt-hadoop-2.7}}:

{code}
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.01215805999 
seconds. Last failure message: There are 1 possibly leaked file streams..
{code}

Here's the full history: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/

>From a very quick look, these failures seem to be correlated with 
>https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
>the following stack trace (full logs 
>[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> 

{code}
[info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
connection created at:
java.lang.Throwable
at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
{code}

Also, while this might be just a false correlation but the frequency of these 
test failures have increased considerably in 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
 after https://github.com/apache/spark/pull/20562 (cc 
[~feng...@databricks.com]) was merged.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357363#comment-16357363
 ] 

Sameer Agarwal commented on SPARK-23309:


Thanks, I'll then go ahead and downgrade the priority for now to unblock RC3. 
Please feel free to -1 the RC if there's a repro.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23309:
---
Priority: Major  (was: Blocker)

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-08 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357344#comment-16357344
 ] 

Sameer Agarwal commented on SPARK-23309:


[~tgraves] [~smilegator] [~cloud_fan] – any advice here? If we'd like this to 
continue to block the release on this, it'd be good to have a repro.

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356176#comment-16356176
 ] 

Sameer Agarwal commented on SPARK-23348:


yes, +1

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  
> All Spark 2.X are the same. For Spark 1.6.3,
> {code}
> scala> sql("select * from tx").show
> ++---+
> |   i|  j|
> ++---+
> |null|  3|
> |   1|  a|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23334) Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

2018-02-05 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23334:
---
Priority: Blocker  (was: Major)

> Fix pandas_udf with return type StringType() to handle str type properly in 
> Python 2.
> -
>
> Key: SPARK-23334
> URL: https://issues.apache.org/jira/browse/SPARK-23334
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Priority: Blocker
>
> In Python 2, when pandas_udf tries to return string type value created in the 
> udf with {{".."}}, the execution fails. E.g.,
> {code:java}
> from pyspark.sql.functions import pandas_udf, col
> import pandas as pd
> df = spark.range(10)
> str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
> df.select(str_f(col('id'))).show()
> {code}
> raises the following exception:
> {code}
> ...
> java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: 
> expected StringType, got BinaryType
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93)
> ...
> {code}
> Seems like pyarrow ignores {{type}} parameter for {{pa.Array.from_pandas()}} 
> and consider it as binary type when the type is string type and the string 
> values are {{str}} instead of {{unicode}} in Python 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-05 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23310.

   Resolution: Done
 Assignee: Sital Kedia
Fix Version/s: 2.3.0

Issue resolved by pull request 20492 https://github.com/apache/spark/pull/20492

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Sital Kedia
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23330) Spark UI SQL executions page throws NPE

2018-02-04 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23330:
---
Description: 
Spark UI SQL executions page throws the following error and the page crashes:

{code}
 HTTP ERROR 500
 Problem accessing /SQL/. Reason:

Server Error
 Caused by:
 java.lang.NullPointerException
 at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47)
 at scala.collection.immutable.StringOps.length(StringOps.scala:47)
 at 
scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
 at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29)
 at scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111)
 at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29)
 at 
org.apache.spark.sql.execution.ui.ExecutionTable.descriptionCell(AllExecutionsPage.scala:182)
 at 
org.apache.spark.sql.execution.ui.ExecutionTable.row(AllExecutionsPage.scala:155)
 at 
org.apache.spark.sql.execution.ui.ExecutionTable$$anonfun$8.apply(AllExecutionsPage.scala:204)
 at 
org.apache.spark.sql.execution.ui.ExecutionTable$$anonfun$8.apply(AllExecutionsPage.scala:204)
 at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:339)
 at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:339)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.AbstractTraversable.map(Traversable.scala:104)
 at org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:339)
 at 
org.apache.spark.sql.execution.ui.ExecutionTable.toNodeSeq(AllExecutionsPage.scala:203)
 at 
org.apache.spark.sql.execution.ui.AllExecutionsPage.render(AllExecutionsPage.scala:67)
 at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
 at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
 at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
 at org.eclipse.jetty.server.Server.handle(Server.java:534)
 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
 at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
 at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
 at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
 at java.lang.Thread.run(Thread.java:748)
{code}

 Seems the bug is imported by 
[https://github.com/apache/spark/pull/19681/files#diff-a74d84702d8d47d5269e96740a55a3caR63]

  was:
Spark UI SQL executions page throws the following error and the page crashes:
```
HTTP ERROR 500
Problem accessing /SQL/. Reason:

Server Error
Caused by:
java.lang.NullPointerException
at 
scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47)
at scala.collection.immutable.StringOps.length(StringOps.scala:47)
at 
scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29)
at 
scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111)
at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29)
at

[jira] [Commented] (SPARK-23324) Announce new Kubernetes back-end for 2.3 release notes

2018-02-02 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351060#comment-16351060
 ] 

Sameer Agarwal commented on SPARK-23324:


 

Thanks [~eje], this is definitely going to be a major highlight for 2.3.

I'll soon send out a google doc link to the dev list so that we all can 
contribute/review them before they go out.

> Announce new Kubernetes back-end for 2.3 release notes
> --
>
> Key: SPARK-23324
> URL: https://issues.apache.org/jira/browse/SPARK-23324
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: documentation, kubernetes, releasenotes
>
> This is an issue to request that the new Kubernetes scheduler back-end gets 
> called out in the 2.3 release notes, as it is a prominent new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-02 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351048#comment-16351048
 ] 

Sameer Agarwal commented on SPARK-23310:


[~sitalke...@gmail.com] it'd be great if you can create the patch.

 

cc [~npoggi] [~juliuszsompolski] who can give more context on the bug they 
found.

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Blocker
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-02 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350860#comment-16350860
 ] 

Sameer Agarwal commented on SPARK-23290:


[~amenck] [~aash] any updates here?

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23293) data source v2 self join fails

2018-02-01 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23293.

   Resolution: Fixed
Fix Version/s: 2.3.0

> data source v2 self join fails
> --
>
> Key: SPARK-23293
> URL: https://issues.apache.org/jira/browse/SPARK-23293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-02-01 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23309:
---
Target Version/s: 2.3.0

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot

2018-02-01 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349353#comment-16349353
 ] 

Sameer Agarwal commented on SPARK-23284:


Thanks, given the other open blockers, we should have enough time to get this 
in before RC3.

> Document several get API of ColumnVector's behavior when accessing null slot
> 
>
> Key: SPARK-23284
> URL: https://issues.apache.org/jira/browse/SPARK-23284
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We should clearly document the behavior of some ColumnVector get APIs such as 
> getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs 
> should return null if the slot is null.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349346#comment-16349346
 ] 

Sameer Agarwal commented on SPARK-23304:


Also, is there a JIRA/repro for the caching issue you mentioned? We can 
continue to investigate that in parallel (cc [~kiszk])

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23307) Spark UI should sort jobs/stages with the completed timestamp before cleaning up them

2018-02-01 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349323#comment-16349323
 ] 

Sameer Agarwal commented on SPARK-23307:


Bumping this to a blocker for 2.3

> Spark UI should sort jobs/stages with the completed timestamp before cleaning 
> up them
> -
>
> Key: SPARK-23307
> URL: https://issues.apache.org/jira/browse/SPARK-23307
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> When you have a long running job, it may be deleted from UI quickly when it 
> completes, if you happen to run a small job after it. It's pretty annoying 
> when you run lots of jobs in the same driver concurrently (e.g., running 
> multiple Structured Streaming queries). We should sort jobs/stages with the 
> completed timestamp before cleaning up them.
> In 2.2, Spark has a separated buffer for completed jobs/stages, so it doesn't 
> need to sort the jobs/stages.
> What's the behavior I expect:
> Set "spark.ui.retainedJobs" to 10 and run the following codes, job 0 should 
> be kept in the Spark UI.
>  
> {code:java}
> new Thread() {
>   override def run() {
>     // job 0
>     sc.makeRDD(1 to 1, 1).foreach { i =>
>     Thread.sleep(1)
>    }
>   }
> }.start()
> Thread.sleep(1000)
> for (_ <- 1 to 20) {
>   new Thread() {
>     override def run() {
>       sc.makeRDD(1 to 1, 1).foreach { i =>
>       }
>     }
>   }.start()
> }
> Thread.sleep(15000)
>   sc.makeRDD(1 to 1, 1).foreach { i =>
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23307) Spark UI should sort jobs/stages with the completed timestamp before cleaning up them

2018-02-01 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23307:
---
Priority: Blocker  (was: Major)

> Spark UI should sort jobs/stages with the completed timestamp before cleaning 
> up them
> -
>
> Key: SPARK-23307
> URL: https://issues.apache.org/jira/browse/SPARK-23307
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> When you have a long running job, it may be deleted from UI quickly when it 
> completes, if you happen to run a small job after it. It's pretty annoying 
> when you run lots of jobs in the same driver concurrently (e.g., running 
> multiple Structured Streaming queries). We should sort jobs/stages with the 
> completed timestamp before cleaning up them.
> In 2.2, Spark has a separated buffer for completed jobs/stages, so it doesn't 
> need to sort the jobs/stages.
> What's the behavior I expect:
> Set "spark.ui.retainedJobs" to 10 and run the following codes, job 0 should 
> be kept in the Spark UI.
>  
> {code:java}
> new Thread() {
>   override def run() {
>     // job 0
>     sc.makeRDD(1 to 1, 1).foreach { i =>
>     Thread.sleep(1)
>    }
>   }
> }.start()
> Thread.sleep(1000)
> for (_ <- 1 to 20) {
>   new Thread() {
>     override def run() {
>       sc.makeRDD(1 to 1, 1).foreach { i =>
>       }
>     }
>   }.start()
> }
> Thread.sleep(15000)
>   sc.makeRDD(1 to 1, 1).foreach { i =>
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23307) Spark UI should sort jobs/stages with the completed timestamp before cleaning up them

2018-02-01 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23307:
---
Target Version/s: 2.3.0

> Spark UI should sort jobs/stages with the completed timestamp before cleaning 
> up them
> -
>
> Key: SPARK-23307
> URL: https://issues.apache.org/jira/browse/SPARK-23307
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> When you have a long running job, it may be deleted from UI quickly when it 
> completes, if you happen to run a small job after it. It's pretty annoying 
> when you run lots of jobs in the same driver concurrently (e.g., running 
> multiple Structured Streaming queries). We should sort jobs/stages with the 
> completed timestamp before cleaning up them.
> In 2.2, Spark has a separated buffer for completed jobs/stages, so it doesn't 
> need to sort the jobs/stages.
> What's the behavior I expect:
> Set "spark.ui.retainedJobs" to 10 and run the following codes, job 0 should 
> be kept in the Spark UI.
>  
> {code:java}
> new Thread() {
>   override def run() {
>     // job 0
>     sc.makeRDD(1 to 1, 1).foreach { i =>
>     Thread.sleep(1)
>    }
>   }
> }.start()
> Thread.sleep(1000)
> for (_ <- 1 to 20) {
>   new Thread() {
>     override def run() {
>       sc.makeRDD(1 to 1, 1).foreach { i =>
>       }
>     }
>   }.start()
> }
> Thread.sleep(15000)
>   sc.makeRDD(1 to 1, 1).foreach { i =>
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349147#comment-16349147
 ] 

Sameer Agarwal commented on SPARK-23304:


[~tgraves] just to rule out the obvious, was there a difference in the number 
of partitions in {{spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable 
WHERE dt >= '20170301' AND dt <= '20170331' AND something IS NOT NULL")}} in 
Spark 2.2 and 2.3?

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Blocker
>
> Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
> hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
>  
> Query:
> spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>  
> in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-02-01 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349090#comment-16349090
 ] 

Sameer Agarwal commented on SPARK-23107:


Thanks [~yanboliang], I'll cut the next RC as soon as the remaining blockers 
are resolved: [https://s.apache.org/oXKi]

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347576#comment-16347576
 ] 

Sameer Agarwal commented on SPARK-23107:


[~yanboliang] other than adding docs, are you considering any pending API 
changes that should block the next RC?

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-30 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23202:
---
Target Version/s: 2.3.0

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Blocker
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-28 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23020:
---
Fix Version/s: (was: 2.3.0)

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-28 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23020:
---
Affects Version/s: (was: 2.3.0)
   2.4.0

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-28 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23020:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-28 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reopened SPARK-23020:


> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-28 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342985#comment-16342985
 ] 

Sameer Agarwal commented on SPARK-23020:


I'm sorry but the flakiness in the test still refuses to go away: 
[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/154/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/.|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/154/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/]

 

Per Marcelo's suggestion, I'm going to (only) disable this test in 2.3. The 
master builds are failing similarly 
([https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4426/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/)]
 so I hope it'll not hinder any investigation.

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reopened SPARK-22797:


> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-01-26 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22797:
---
Fix Version/s: (was: 2.3.0)

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-01-26 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23207.

   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.0

Issue resolved by pull request 20393 https://github.com/apache/spark/pull/20393

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.3.0, 2.4.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21717) Decouple the generated codes of consuming rows in operators under whole-stage codegen

2018-01-24 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-21717:
---
Target Version/s: 2.3.0
Priority: Critical  (was: Major)

> Decouple the generated codes of consuming rows in operators under whole-stage 
> codegen
> -
>
> Key: SPARK-21717
> URL: https://issues.apache.org/jira/browse/SPARK-21717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Priority: Critical
>
> It has been observed in SPARK-21603 that whole-stage codegen suffers 
> performance degradtion, if generated functions are too long to be optimized 
> by JIT.
> We basically produce a single function to incorporate generated codes from 
> all physical operators in whole-stage. Thus, it is possibly to grow the size 
> of generated function over a threshold that we can't have JIT optimization 
> for it anymore.
> This ticket is trying to decouple the logic of consuming rows in physical 
> operators to avoid a giant function processing rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

2018-01-24 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23207:
--

Assignee: Jiang Xingbo

> Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

2018-01-24 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23207:
---
Priority: Blocker  (was: Major)

> Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-24 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338016#comment-16338016
 ] 

Sameer Agarwal commented on SPARK-23020:


FYI The {{SparkLauncherSuite}} test is still failing occasionally (a lot less 
common though): 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/142/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects

2018-01-23 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22739:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects

2018-01-23 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336565#comment-16336565
 ] 

Sameer Agarwal commented on SPARK-22739:


Agree, I'll re-target this for 2.4.0 for now – [~aeskilson] [~marmbrus] please 
let us know if you disagree.

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-21646) Add new type coercion rules to compatible with Hive

2018-01-22 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-21646:
---
Comment: was deleted

(was: Marking it as a duplicate of SPARK-22722)

> Add new type coercion rules to compatible with Hive
> ---
>
> Key: SPARK-21646
> URL: https://issues.apache.org/jira/browse/SPARK-21646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Type_coercion_rules_to_compatible_with_Hive.pdf
>
>
> How to reproduce:
> hive:
> {code:sql}
> $ hive -S
> hive> create table spark_21646(c1 string, c2 string);
> hive> insert into spark_21646 values('92233720368547758071', 'a');
> hive> insert into spark_21646 values('21474836471', 'b');
> hive> insert into spark_21646 values('10', 'c');
> hive> select * from spark_21646 where c1 > 0;
> 92233720368547758071  a
> 10c
> 21474836471   b
> hive>
> {code}
> spark-sql:
> {code:sql}
> $ spark-sql -S
> spark-sql> select * from spark_21646 where c1 > 0;
> 10  c 
>   
> spark-sql> select * from spark_21646 where c1 > 0L;
> 21474836471   b
> 10c
> spark-sql> explain select * from spark_21646 where c1 > 0;
> == Physical Plan ==
> *Project [c1#14, c2#15]
> +- *Filter (isnotnull(c1#14) && (cast(c1#14 as int) > 0))
>+- *FileScan parquet spark_21646[c1#14,c2#15] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[viewfs://cluster4/user/hive/warehouse/spark_21646], 
> PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> spark-sql> 
> {code}
> As you can see, spark auto cast c1 to int type, if this value out of integer 
> range, the result is different from Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-21 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23000.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by [https://github.com/apache/spark/pull/20328]

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-19 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23135.

Resolution: Fixed

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-19 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23135:
--

 Assignee: Marcelo Vanzin
Fix Version/s: 2.3.0

Issue resolved by pull request 20299 https://github.com/apache/spark/pull/20299

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-17 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23135:
---
Target Version/s: 2.3.0

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Priority: Blocker
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23104) Document that kubernetes is still "experimental"

2018-01-17 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23104:
--

Assignee: Anirudh Ramanathan

> Document that kubernetes is still "experimental"
> 
>
> Key: SPARK-23104
> URL: https://issues.apache.org/jira/browse/SPARK-23104
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Anirudh Ramanathan
>Priority: Critical
>
> As discussed in the mailing list, we should document that the kubernetes 
> backend is still experimental.
> That does not need to include any code changes. This is just meant to tell 
> users that they can expect changes in how the backend behaves in future 
> versions, and that things like configuration, the container image's layout 
> and entry points might change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23104) Document that kubernetes is still "experimental"

2018-01-17 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329644#comment-16329644
 ] 

Sameer Agarwal commented on SPARK-23104:


Sounds great, thanks!

> Document that kubernetes is still "experimental"
> 
>
> Key: SPARK-23104
> URL: https://issues.apache.org/jira/browse/SPARK-23104
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> As discussed in the mailing list, we should document that the kubernetes 
> backend is still experimental.
> That does not need to include any code changes. This is just meant to tell 
> users that they can expect changes in how the backend behaves in future 
> versions, and that things like configuration, the container image's layout 
> and entry points might change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-17 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23020:
---
Summary: Re-enable Flaky Test: 
org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher  (was: Flaky 
Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher)

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-16 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328325#comment-16328325
 ] 

Sameer Agarwal commented on SPARK-23020:


I had to revert this patch as it broke [{{YarnClusterSuite.timeout to get 
SparkContext in cluster mode triggers failure}}| 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/94/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/timeout_to_get_SparkContext_in_cluster_mode_triggers_failure/history/]
 

[{{SparkLauncherSuite.testInProcessLauncher}}|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/90/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/]
 seems to be still flaky.

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-16 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reopened SPARK-23020:


> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-16 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23020:
---
Fix Version/s: (was: 2.3.0)

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects

2018-01-15 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22739:
---
Priority: Major  (was: Critical)

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-15 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23020.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20223
[https://github.com/apache/spark/pull/20223]

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-15 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23020:
--

Assignee: Marcelo Vanzin

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325396#comment-16325396
 ] 

Sameer Agarwal commented on SPARK-23065:


No, the jekyll error didn't fail the doc build. We should investigate why that 
happened.

Style wise, both 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
 and https://spark.apache.org/docs/latest/api/R/index.html look identical to me 
(attached screenshots). Is there something that I'm missing? Could it be your 
local browser cache?

By the way, FWIW, the R logo looks better in 
https://spark.apache.org/docs/2.2.0/api/R/index.html so it seems like something 
might've changed in 2.2.1.





> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
> Attachments: Screen Shot 2018-01-13 at 3.15.48 PM.png, Screen Shot 
> 2018-01-13 at 3.16.06 PM.png
>
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23065:
---
Attachment: Screen Shot 2018-01-13 at 3.15.48 PM.png
Screen Shot 2018-01-13 at 3.16.06 PM.png

> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
> Attachments: Screen Shot 2018-01-13 at 3.15.48 PM.png, Screen Shot 
> 2018-01-13 at 3.16.06 PM.png
>
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325068#comment-16325068
 ] 

Sameer Agarwal commented on SPARK-23065:


Thanks for verifying this [~felixcheung]! While the build didn't fail in an 
obvious way, it seems like it only uploaded partially generated R docs. I tried 
running it manually and it turned out to be a roxygen version issue:

{code}
~/dev/spark/spark/docs SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_SQLDOC=1 jekyll 
build
Configuration file: /Users/sameer/dev/spark/spark/docs/_config.yml
   Deprecation: The 'gems' configuration option has been renamed to 
'plugins'. Please update your config file accordingly.
Moving to R directory and building roxygen docs.
Using Scala 2.11
Using R_SCRIPT_PATH = /usr/local/bin
 dirname /Users/sameer/dev/spark/spark/R/install-dev.sh
+++ cd /Users/sameer/dev/spark/spark/R
+++ pwd
++ FWDIR=/Users/sameer/dev/spark/spark/R
++ LIB_DIR=/Users/sameer/dev/spark/spark/R/lib
++ mkdir -p /Users/sameer/dev/spark/spark/R/lib
++ pushd /Users/sameer/dev/spark/spark/R
++ . /Users/sameer/dev/spark/spark/R/find-r.sh
+++ '[' -z /usr/local/bin ']'
++ . /Users/sameer/dev/spark/spark/R/create-rd.sh
+++ set -o pipefail
+++ set -e
+ dirname /Users/sameer/dev/spark/spark/R/create-rd.sh
 cd /Users/sameer/dev/spark/spark/R
 pwd
+++ FWDIR=/Users/sameer/dev/spark/spark/R
+++ pushd /Users/sameer/dev/spark/spark/R
+++ . /Users/sameer/dev/spark/spark/R/find-r.sh
 '[' -z /usr/local/bin ']'
+++ /usr/local/bin/Rscript -e ' if("devtools" %in% 
rownames(installed.packages())) { library(devtools); 
devtools::document(pkg="./pkg", roclets=c("rd")) }'
Error: ‘roxygen2’ >= 5.0.0 must be installed for this functionality.
Execution halted
jekyll 3.7.0 | Error:  R doc generation failed
{code}

I've fixed this issue and updated the docs at 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/index.html. 
Can you please verify if everything looks okay now?

> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23065:
--

Assignee: Sameer Agarwal

> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23063) Changes to publish the spark-kubernetes package

2018-01-12 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23063:
---
Target Version/s: 2.3.0
Priority: Blocker  (was: Major)

> Changes to publish the spark-kubernetes package
> ---
>
> Key: SPARK-23063
> URL: https://issues.apache.org/jira/browse/SPARK-23063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23055) KafkaContinuousSourceSuite Kafka column types test failing

2018-01-12 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23055.

   Resolution: Fixed
 Assignee: Sameer Agarwal
Fix Version/s: 2.3.0

> KafkaContinuousSourceSuite Kafka column types test failing
> --
>
> Key: SPARK-23055
> URL: https://issues.apache.org/jira/browse/SPARK-23055
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Sameer Agarwal
>Priority: Critical
> Fix For: 2.3.0
>
>
> KafkaContinuousSourceSuite Kafka column types test fails 
> (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85998/testReport/junit/org.apache.spark.sql.kafka010/KafkaContinuousSourceSuite/Kafka_column_types/).
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4009/
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 279 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (9 seconds, 202 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (8 seconds, 108 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 102 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (12 seconds, 125 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (12 seconds, 69 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> true) (5 seconds, 935 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> true) (13 seconds, 70 milliseconds)
> [info] - subscribing topic by pattern from earliest offsets (failOnDataLoss: 
> true) (13 seconds, 122 milliseconds)
> [info] - subscribing topic by pattern from specific offsets (failOnDataLoss: 
> true) (7 seconds, 877 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: false) (12 seconds, 201 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: false) (12 seconds, 82 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: false) (8 seconds, 530 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: 
> false) (18 seconds, 339 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> false) (17 seconds, 397 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> false) (8 seconds, 926 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> false) (20 seconds, 198 milliseconds)
> Build timed out (after 255 minutes). Marking the build as aborted.
> Build was aborted
> Archiving artifacts
> [info] - subscribing topic by pattern from earliest offsets (failOnDataLoss: 
> false) *** FAILED *** (2 hours, 24 minutes, 19 seconds)
> [info]   Error while stopping stream: 
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4010/console
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 238 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (9 seconds, 516 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (7 seconds, 961 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 193 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (11 seconds, 443 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (12 seconds, 674 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> true) (6 seconds, 13 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> true) (13 seconds, 185 milliseconds)
> Build timed out (after 255 minutes). Marking the build as aborted.
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4011/consoleFull
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 551 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (8 seconds, 560 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (8 seconds, 40 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 373 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (12 seconds, 872 milliseconds)
>

[jira] [Commented] (SPARK-23055) KafkaContinuousSourceSuite Kafka column types test failing

2018-01-12 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324718#comment-16324718
 ] 

Sameer Agarwal commented on SPARK-23055:


I've reverted https://github.com/apache/spark/pull/20096 and re-opened 
SPARK-22908 to deflake the builds. Thanks!

> KafkaContinuousSourceSuite Kafka column types test failing
> --
>
> Key: SPARK-23055
> URL: https://issues.apache.org/jira/browse/SPARK-23055
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Critical
>
> KafkaContinuousSourceSuite Kafka column types test fails 
> (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85998/testReport/junit/org.apache.spark.sql.kafka010/KafkaContinuousSourceSuite/Kafka_column_types/).
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4009/
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 279 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (9 seconds, 202 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (8 seconds, 108 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 102 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (12 seconds, 125 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (12 seconds, 69 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> true) (5 seconds, 935 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> true) (13 seconds, 70 milliseconds)
> [info] - subscribing topic by pattern from earliest offsets (failOnDataLoss: 
> true) (13 seconds, 122 milliseconds)
> [info] - subscribing topic by pattern from specific offsets (failOnDataLoss: 
> true) (7 seconds, 877 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: false) (12 seconds, 201 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: false) (12 seconds, 82 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: false) (8 seconds, 530 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: 
> false) (18 seconds, 339 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> false) (17 seconds, 397 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> false) (8 seconds, 926 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> false) (20 seconds, 198 milliseconds)
> Build timed out (after 255 minutes). Marking the build as aborted.
> Build was aborted
> Archiving artifacts
> [info] - subscribing topic by pattern from earliest offsets (failOnDataLoss: 
> false) *** FAILED *** (2 hours, 24 minutes, 19 seconds)
> [info]   Error while stopping stream: 
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4010/console
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 238 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (9 seconds, 516 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (7 seconds, 961 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 193 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (11 seconds, 443 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (12 seconds, 674 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> true) (6 seconds, 13 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> true) (13 seconds, 185 milliseconds)
> Build timed out (after 255 minutes). Marking the build as aborted.
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4011/consoleFull
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 551 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (8 seconds, 560 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (8 seconds, 40 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 373 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (12 seconds, 872 milliseconds)
>

[jira] [Updated] (SPARK-22908) add basic continuous kafka source

2018-01-12 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22908:
---
Target Version/s: 2.3.0

> add basic continuous kafka source
> -
>
> Key: SPARK-22908
> URL: https://issues.apache.org/jira/browse/SPARK-22908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 2.3.0, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22908) add basic continuous kafka source

2018-01-12 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reopened SPARK-22908:


Re-opening this ticket as we've temporarily reverted 
https://github.com/apache/spark/pull/20096 due to build timeouts.

> add basic continuous kafka source
> -
>
> Key: SPARK-22908
> URL: https://issues.apache.org/jira/browse/SPARK-22908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 2.3.0, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23055) KafkaContinuousSourceSuite Kafka column types test failing

2018-01-12 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23055:
---
Target Version/s: 2.3.0

> KafkaContinuousSourceSuite Kafka column types test failing
> --
>
> Key: SPARK-23055
> URL: https://issues.apache.org/jira/browse/SPARK-23055
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Critical
>
> KafkaContinuousSourceSuite Kafka column types test fails 
> (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85998/testReport/junit/org.apache.spark.sql.kafka010/KafkaContinuousSourceSuite/Kafka_column_types/).
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4009/
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 279 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (9 seconds, 202 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (8 seconds, 108 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 102 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (12 seconds, 125 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (12 seconds, 69 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> true) (5 seconds, 935 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> true) (13 seconds, 70 milliseconds)
> [info] - subscribing topic by pattern from earliest offsets (failOnDataLoss: 
> true) (13 seconds, 122 milliseconds)
> [info] - subscribing topic by pattern from specific offsets (failOnDataLoss: 
> true) (7 seconds, 877 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: false) (12 seconds, 201 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: false) (12 seconds, 82 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: false) (8 seconds, 530 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: 
> false) (18 seconds, 339 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> false) (17 seconds, 397 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> false) (8 seconds, 926 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> false) (20 seconds, 198 milliseconds)
> Build timed out (after 255 minutes). Marking the build as aborted.
> Build was aborted
> Archiving artifacts
> [info] - subscribing topic by pattern from earliest offsets (failOnDataLoss: 
> false) *** FAILED *** (2 hours, 24 minutes, 19 seconds)
> [info]   Error while stopping stream: 
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4010/console
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 238 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (9 seconds, 516 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (7 seconds, 961 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 193 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (11 seconds, 443 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (12 seconds, 674 milliseconds)
> [info] - subscribing topic by name from specific offsets (failOnDataLoss: 
> true) (6 seconds, 13 milliseconds)
> [info] - subscribing topic by pattern from latest offsets (failOnDataLoss: 
> true) (13 seconds, 185 milliseconds)
> Build timed out (after 255 minutes). Marking the build as aborted.
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4011/consoleFull
> {code}
> [info] KafkaContinuousSourceSuite:
> [info] - cannot stop Kafka stream (1 second, 551 milliseconds)
> [info] - assign from latest offsets (failOnDataLoss: true) (8 seconds, 560 
> milliseconds)
> [info] - assign from earliest offsets (failOnDataLoss: true) (8 seconds, 40 
> milliseconds)
> [info] - assign from specific offsets (failOnDataLoss: true) (4 seconds, 373 
> milliseconds)
> [info] - subscribing topic by name from latest offsets (failOnDataLoss: true) 
> (12 seconds, 872 milliseconds)
> [info] - subscribing topic by name from earliest offsets (failOnDataLoss: 
> true) (13 seconds, 338 milliseconds)
> [info] -

[jira] [Updated] (SPARK-23021) AnalysisBarrier should not cut off the explain output for Parsed Logical Plan

2018-01-09 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23021:
---
Target Version/s: 2.3.0
Priority: Major  (was: Minor)

> AnalysisBarrier should not cut off the explain output for Parsed Logical Plan
> -
>
> Key: SPARK-23021
> URL: https://issues.apache.org/jira/browse/SPARK-23021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>
> In PR#20094 as a follow up to SPARK-20392, there were some fixes to the 
> handling of {{AnalysisBarrier}}, but there seem to be more cases that need to 
> be fixed.
> One such case is that right now the Parsed Logical Plan in explain output 
> would be cutoff by {{AnalysisBarrier}}, e.g.
> {code:none}
> scala> val df1 = spark.range(1).select('id as 'x, 'id + 1 as 
> 'y).repartition(1).select('x === 'y)
> df1: org.apache.spark.sql.DataFrame = [(x = y): boolean]
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [('x = 'y) AS (x = y)#22]
> +- AnalysisBarrier Repartition 1, true
> == Analyzed Logical Plan ==
> (x = y): boolean
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + cast(1 as bigint)) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Optimized Logical Plan ==
> Project [(x#16L = y#17L) AS (x = y)#22]
> +- Repartition 1, true
>+- Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- Range (0, 1, step=1, splits=Some(8))
> == Physical Plan ==
> *Project [(x#16L = y#17L) AS (x = y)#22]
> +- Exchange RoundRobinPartitioning(1)
>+- *Project [id#13L AS x#16L, (id#13L + 1) AS y#17L]
>   +- *Range (0, 1, step=1, splits=8)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-09 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23000:
---
Target Version/s: 2.3.0

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-09 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reopened SPARK-23000:


> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-09 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23000:
---
Priority: Blocker  (was: Major)

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-09 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319492#comment-16319492
 ] 

Sameer Agarwal commented on SPARK-23020:


cc [~vanzin]

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-09 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-23020:
--

 Summary: Flaky Test: 
org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
 Key: SPARK-23020
 URL: https://issues.apache.org/jira/browse/SPARK-23020
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.0
Reporter: Sameer Agarwal
Priority: Blocker


https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD

2018-01-09 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-23019:
--

 Summary: Flaky Test: 
org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
 Key: SPARK-23019
 URL: https://issues.apache.org/jira/browse/SPARK-23019
 Project: Spark
  Issue Type: Bug
  Components: Java API, Tests
Affects Versions: 2.3.0
Reporter: Sameer Agarwal


{{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to 
multiple spark contexts: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/

{code}
Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore 
this error, set spark.driver.allowMultipleContexts = true. The currently 
running SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:116)
org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63)
java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD

2018-01-09 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23019:
---
Priority: Blocker  (was: Major)

> Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
> -
>
> Key: SPARK-23019
> URL: https://issues.apache.org/jira/browse/SPARK-23019
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to 
> multiple spark contexts: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/
> {code}
> Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore 
> this error, set spark.driver.allowMultipleContexts = true. The currently 
> running SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:116)
> org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63)
> java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16060) Vectorized Orc reader

2018-01-09 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-16060:
---
Labels: release-notes  (was: )

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Dongjoon Hyun
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22386) Data Source V2 improvements

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22386:
---
Target Version/s: 2.3.0

> Data Source V2 improvements
> ---
>
> Key: SPARK-22386
> URL: https://issues.apache.org/jira/browse/SPARK-22386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18569) Support R formula arithmetic

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-18569:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16026) Cost-based Optimizer Framework

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-16026.

   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.3.0

Resolving this ticket as all the individual subtasks have been resolved. Thanks 
for the great work everyone!

> Cost-based Optimizer Framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Zhenhua Wang
>  Labels: releasenotes
> Fix For: 2.3.0
>
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-4502:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-01-08 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317062#comment-16317062
 ] 

Sameer Agarwal commented on SPARK-4502:
---

+1 This is an extremely useful feature and we should definitely prioritize its 
review.

However, given that 2.3.0 timeline, this will unfortunately not make the 
release. Therefore I'm re-targeting this for 2.4.0.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (in Spark 2.x)

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-9576:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> DataFrame API improvement umbrella ticket (in Spark 2.x)
> 
>
> Key: SPARK-9576
> URL: https://issues.apache.org/jira/browse/SPARK-9576
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7768) Make user-defined type (UDT) API public

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-7768:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-12978:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-13184:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-13682:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-14098:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14543) SQL/Hive insertInto has unexpected results

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-14543:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> SQL/Hive insertInto has unexpected results
> --
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>
> *Updated description*
> There should be an option to match input data to output columns by name. The 
> API allows operations on tables, which hide the column resolution problem. 
> It's easy to copy from one table to another without listing the columns, and 
> in the API it is common to work with columns by name rather than by position. 
> I think the API should add a way to match columns by name, which is closer to 
> what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile 
> incoming data columns with the outgoing table schema. Columns are matched by 
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran 
> into this by not using a correct {{partitionBy}} call (addressed by 
> SPARK-14459), which caused an error message that an int could not be cast to 
> an array. However, if the columns are vaguely compatible, for example string 
> and float, then no error or warning is produced and data is written to the 
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes 
> by adding a column in the middle of a table. Spark SQL statements that copied 
> from that table to a destination table will then map the columns differently 
> but insert casts that mask the problem. The last column's data will be 
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException 
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using 
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and 
> LogicalRelation
> Also, I think there should be an option to match input data to output columns 
> by name. The API allows operations on tables, which hide the column 
> resolution problem. It's easy to copy from one table to another without 
> listing the columns, and in the API it is common to work with columns by name 
> rather than by position. I think the API should add a way to match columns by 
> name, which is closer to what users expect. I propose adding something like 
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15420:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15380) Generate code that stores a float/double value in each column from ColumnarBatch when DataFrame.cache() is used

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15380:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Generate code that stores a float/double value in each column from 
> ColumnarBatch when DataFrame.cache() is used
> ---
>
> Key: SPARK-15380
> URL: https://issues.apache.org/jira/browse/SPARK-15380
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data will be stored as column-oriented 
> storage in CachedBatch. The current Catalyst generates Java program to store 
> a computed value to an InternalRow and then the value is stored into 
> CachedBatch even if data is read from ColumnarBatch for ParquetReader. 
> This JIRA generates Java code to store a value into a ColumnarBatch, and 
> store data from the ColumnarBatch to the CachedBatch. This JIRA handles only 
> float and double types for a value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes

2018-01-08 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317050#comment-16317050
 ] 

Sameer Agarwal commented on SPARK-15420:


re-targeting this for 2.4.0

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15117) Generate code that get a value in each compressed column from CachedBatch when DataFrame.cache() is called

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15117:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Generate code that get a value in each compressed column from CachedBatch 
> when DataFrame.cache() is called
> --
>
> Key: SPARK-15117
> URL: https://issues.apache.org/jira/browse/SPARK-15117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> Once SPARK-14098 is merged, we will migrate a feature in this JIRA entry.
> When DataFrame.cache() is called, data is stored as column-oriented storage 
> in CachedBatch. The current Catalyst generates Java program to get a value of 
> a column from an InternalRow that is translated from CachedBatch. This issue 
> generates Java code to get a value of a column from CachedBatch. This JIRA 
> entry supports other primitive types (boolean/byte/short/int/long) whose 
> column may be compressed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15690:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15691) Refactor and improve Hive support

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15691:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog (So, for a CatalogTable returned by HiveExternalCatalog, 
> we do not need to distinguish tables stored in hive formats and data source 
> tables).
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15693:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Write schema definition out for file-based data sources to avoid schema 
> inference
> -
>
> Key: SPARK-15693
> URL: https://issues.apache.org/jira/browse/SPARK-15693
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark supports reading a variety of data format, many of which don't have 
> self-describing schema. For these file formats, Spark often can infer the 
> schema by going through all the data. However, schema inference is expensive 
> and does not always infer the intended schema (for example, with json data 
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based 
> formats, and when reading the data in, schema can be "inferred" directly by 
> reading the schema definition file without going through full schema 
> inference. If the file does not exist, then the good old schema inference 
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for 
> schema definition, as well as all the corner cases that this feature needs to 
> handle (e.g. schema merging, schema evolution, partitioning). It would be 
> great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15694) Implement ScriptTransformation in sql/core

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15694:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16011) SQL metrics include duplicated attempts

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-16011:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> SQL metrics include duplicated attempts
> ---
>
> Key: SPARK-16011
> URL: https://issues.apache.org/jira/browse/SPARK-16011
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>
> When I ran a simple scan and aggregate query, the number of rows in scan 
> could be different from run to run, but actually scanned result is correct, 
> the SQL metrics is wrong (should not include duplicated attempt), this is a 
> regression since 1.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-15867:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Use bucket files for TABLESAMPLE BUCKET
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-16196:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16217) Support SELECT INTO statement

2018-01-08 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-16217:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Support SELECT INTO statement
> -
>
> Key: SPARK-16217
> URL: https://issues.apache.org/jira/browse/SPARK-16217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: GuangFancui(ISCAS)
>
> The *SELECT INTO* statement selects data from one table and inserts it into a 
> new table as follows.
> {code:sql}
> SELECT column_name(s)
> INTO newtable
> FROM table1;
> {code}
> This statement is commonly used in SQL but not currently supported in 
> SparkSQL.
> We investigated the Catalyst and found that this statement can be implemented 
> by improving the grammar and reusing the logical plan of *CREAT TABLE AS 
> SELECT* as follows.
> # Improve grammar: Add _intoClause_ to _SELECT ... FROM_ in 
> _querySpecification_ grammar in SqlBase.g4 file.
> !https://raw.githubusercontent.com/wuxianxingkong/storage/master/selectinto_g4_v2.png!
> For example
>  {code:sql}
> SELECT * 
> INTO NEW_TABLE 
> FROM OLD_TABLE
> {code}
> Then the grammar tree will be: 
> !https://raw.githubusercontent.com/wuxianxingkong/storage/master/selectinto_tree_v2.png!
> Furthermore, we can argue whether it's necessary to add _intoCaluse_ to 
> _TRANSFORM_ in _querySpecification_
> # Identify _SELECT INTO_ in _Parser_: Modify _visitSingleInsertQuery_ 
> function. Extract _IntoClauseContext_ with _existIntoClause_ fucntion. 
> _IntoClauseContext_ is then passed as an argument to _withSelectInto_ 
> function .(_intoClause_ and queryOrganization are not in the same level, so 
> we need to extract _IntoClauseContext_ when visiting _singleInsertQuery_)
> # Conversion in _Parser_: Convert current logical plan to _CTAS_(Strictly 
> speaking, as a child of CTAS) using _withSelectInto_ function. 
> *Hive support* should be opened since _CreateHiveTableAsSelectCommand_ relies 
> on it.
> _withSelectInto_ function copies code of _visitCreateTable_ to do conversion. 
> So it requires  further discussion and optimization.
> Implements are based on the following _assumptions_:
> # _intoClause_ must be together with _fromClause_.{code:sql}(intoClause? 
> fromClause)?{code}This structure can ensure that this modification won’t 
> affect existed _multiInsertQuery_.
> # _SELECT INOT_ statement will be translated to  the following tree structure:
> !https://raw.githubusercontent.com/wuxianxingkong/storage/master/hierarchy.png!
> As shown, if there is a _intoClause_, the actual subclass of _queryTerm_ is 
> _queryTermDefault_, besides, the actual subclass of _queryPrimary_ is 
> _queryPrimaryDefault_. We use _existIntoClause_  function to match designated 
> subclass. Only all conditions are satisfied can this function return 
> intoClauseContext, if not, return null.
> We’ve implemented and tested the above approach. Please refer to PR: 
> https://github.com/apache/spark/pull/14191



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 259 matches

Mail list logo