date:20180821

[jira] [Created] (SPARK-25187) Revisit the life cycle of ReadSupport instances.

2018-08-21 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-25187:
---

 Summary: Revisit the life cycle of ReadSupport instances.
 Key: SPARK-25187
 URL: https://issues.apache.org/jira/browse/SPARK-25187
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan


Currently the life cycle is bound to the batch/stream query. This fits 
streaming very well but may not be perfect for batch source. We can also 
consider to let {{ReadSupport.newScanConfigBuilder}} take {{DataSourceOptions}} 
as parameter, if we decide to change the life cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25186) Stabilize Data Source V2 API

2018-08-21 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-25186:
---

 Summary: Stabilize Data Source V2 API 
 Key: SPARK-25186
 URL: https://issues.apache.org/jira/browse/SPARK-25186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588414#comment-16588414
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22183

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-21 Thread Gil Vernik (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Vernik resolved SPARK-25155.

Resolution: Cannot Reproduce

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-21 Thread Gil Vernik (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588412#comment-16588412
 ] 

Gil Vernik commented on SPARK-25155:


[~ste...@apache.org] thanks for the input. This one 
[https://github.com/apache/spark/pull/17745] is perfect! 
But you also created [https://github.com/apache/spark/pull/14731], what is the 
difference between both? The  [https://github.com/apache/spark/pull/17745]  is 
so small and simple, and yet provides great benefit

[~dongjoon] Somehow I can't reproduce the issue i saw the with master branch. 
Need to run more experiments. Closig this issue for now.

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25159) json schema inference should only trigger one job

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25159:

Fix Version/s: 2.4.0

> json schema inference should only trigger one job
> -
>
> Key: SPARK-25159
> URL: https://issues.apache.org/jira/browse/SPARK-25159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25159) json schema inference should only trigger one job

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25159.
-
  Resolution: Fixed
Target Version/s: 2.4.0

> json schema inference should only trigger one job
> -
>
> Key: SPARK-25159
> URL: https://issues.apache.org/jira/browse/SPARK-25159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25140.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Add optional logging to UnsafeProjection.create when it falls back to 
> interpreted mode
> --
>
> Key: SPARK-25140
> URL: https://issues.apache.org/jira/browse/SPARK-25140
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
> to fall back to an interpreter mode when codegen fails. That makes Spark much 
> more usable even when codegen is unable to handle the given query.
> But in its current form, the fallback handling can also be a mystery in terms 
> of performance cliffs. Users may be left wondering why a query runs fine with 
> some expressions, but then with just one extra expression the performance 
> goes 2x, 3x (or more) slower.
> It'd be nice to have optional logging of the fallback behavior, so that for 
> users that care about monitoring performance cliffs, they can opt-in to log 
> when a fallback to interpreter mode was taken. i.e. at
> https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table

2018-08-21 Thread Amit (JIRA)

Amit created SPARK-25185:


 Summary: CBO rowcount statistics doesn't work for partitioned 
parquet external table
 Key: SPARK-25185
 URL: https://issues.apache.org/jira/browse/SPARK-25185
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 2.3.0, 2.2.1
 Environment: Created a dummy partitioned data with partition column on 
string type col1=a and col1=b

added csv data-> read through spark -> created partitioned external table-> 
msck repair table to load partition. Did analyze on all columns and partition 
column as well.

~println(spark.sql("select * from test_p where 
e='1a'").queryExecution.toStringWithStats)~
~val op = spark.sql("select * from test_p where 
e='1a'").queryExecution.optimizedPlan~

// e is the partitioned column
~val stat = op.stats(spark.sessionState.conf)~
~print(stat.rowCount)~

 

Created the same way in parquet the rowcount comes up correctly in case of csv 
but in parquet it shows as None.

 
Reporter: Amit






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24763) Remove redundant key data from value in streaming aggregation

2018-08-21 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588389#comment-16588389
 ] 

Jungtaek Lim commented on SPARK-24763:
--

[~tdas] Got it. Thanks for the input.

> Remove redundant key data from value in streaming aggregation
> -
>
> Key: SPARK-24763
> URL: https://issues.apache.org/jira/browse/SPARK-24763
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Key/Value of state in streaming aggregation is formatted as below:
>  * key: UnsafeRow containing group-by fields
>  * value: UnsafeRow containing key fields and another fields for aggregation 
> results
> which data for key is stored to both key and value.
> This is to avoid doing projection row to value while storing, and joining key 
> and value to restore origin row to boost performance, but while doing a 
> simple benchmark test, I found it not much helpful compared to "project and 
> join". (will paste test result in comment)
> So I would propose a new option: remove redundant in stateful aggregation. 
> I'm avoiding to modify default behavior of stateful aggregation, because 
> state value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25184:


Assignee: (was: Apache Spark)

> Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
> ---
>
> Key: SPARK-25184
> URL: https://issues.apache.org/jira/browse/SPARK-25184
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Tathagata Das
>Priority: Minor
>
> {code}
> Assert on query failed: Check total state rows = List(1), updated state rows 
> = List(2): Array() did not equal List(1) incorrect total rows, recent 
> progresses:
> {
>   "id" : "3598002e-0120-4937-8a36-226e0af992b6",
>   "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
>   "name" : null,
>   "timestamp" : "1970-01-01T00:00:12.000Z",
>   "batchId" : 3,
>   "numInputRows" : 0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 0,
> "triggerExecution" : 0
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "MemoryStream[value#474622]",
> "startOffset" : 2,
> "endOffset" : 2,
> "numInputRows" : 0
>   } ],
>   "sink" : {
> "description" : "MemorySink"
>   }
> }
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)
>   
> org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)
> == Progress ==
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: a
>AdvanceManualClock(1000)
>CheckNewAnswer: [a,1]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1000)
>CheckNewAnswer: [b,1]
>AssertOnQuery(, Check total state rows = List(2), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1)
>CheckNewAnswer: [a,-1],[b,2]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>StopStream
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: c
>AdvanceManualClock(11000)
>CheckNewAnswer: [b,-1],[c,1]
> => AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>AdvanceManualClock(12000)
>AssertOnQuery(, )
>AssertOnQuery(, name)
>CheckNewAnswer: [c,-1]
>AssertOnQuery(, Check total state rows = List(0), updated state 
> rows = List(0))
> == Stream ==
> Output Mode: Update
> Stream state: {MemoryStream[value#474622]: 3}
> Thread state: alive
> Thread stack trace: java.lang.Object.wait(Native Method)
> org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
> == Sink ==
> 0: [a,1]
> 1: [b,1]
> 2: [a,-1] [b,2]
> 3: [b,-1] [c,1]
> == Plan ==
> == Parsed Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) 
> AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
> true]))._2,

[jira] [Assigned] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25184:


Assignee: Apache Spark

> Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
> ---
>
> Key: SPARK-25184
> URL: https://issues.apache.org/jira/browse/SPARK-25184
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> Assert on query failed: Check total state rows = List(1), updated state rows 
> = List(2): Array() did not equal List(1) incorrect total rows, recent 
> progresses:
> {
>   "id" : "3598002e-0120-4937-8a36-226e0af992b6",
>   "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
>   "name" : null,
>   "timestamp" : "1970-01-01T00:00:12.000Z",
>   "batchId" : 3,
>   "numInputRows" : 0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 0,
> "triggerExecution" : 0
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "MemoryStream[value#474622]",
> "startOffset" : 2,
> "endOffset" : 2,
> "numInputRows" : 0
>   } ],
>   "sink" : {
> "description" : "MemorySink"
>   }
> }
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)
>   
> org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)
> == Progress ==
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: a
>AdvanceManualClock(1000)
>CheckNewAnswer: [a,1]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1000)
>CheckNewAnswer: [b,1]
>AssertOnQuery(, Check total state rows = List(2), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1)
>CheckNewAnswer: [a,-1],[b,2]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>StopStream
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: c
>AdvanceManualClock(11000)
>CheckNewAnswer: [b,-1],[c,1]
> => AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>AdvanceManualClock(12000)
>AssertOnQuery(, )
>AssertOnQuery(, name)
>CheckNewAnswer: [c,-1]
>AssertOnQuery(, Check total state rows = List(0), updated state 
> rows = List(0))
> == Stream ==
> Output Mode: Update
> Stream state: {MemoryStream[value#474622]: 3}
> Thread state: alive
> Thread stack trace: java.lang.Object.wait(Native Method)
> org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
> == Sink ==
> 0: [a,1]
> 1: [b,1]
> 2: [a,-1] [b,2]
> 3: [b,-1] [c,1]
> == Plan ==
> == Parsed Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) 
> AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, assertnotnull(assertnotnull(input[0,

[jira] [Commented] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588385#comment-16588385
 ] 

Apache Spark commented on SPARK-25184:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/22182

> Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
> ---
>
> Key: SPARK-25184
> URL: https://issues.apache.org/jira/browse/SPARK-25184
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Tathagata Das
>Priority: Minor
>
> {code}
> Assert on query failed: Check total state rows = List(1), updated state rows 
> = List(2): Array() did not equal List(1) incorrect total rows, recent 
> progresses:
> {
>   "id" : "3598002e-0120-4937-8a36-226e0af992b6",
>   "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
>   "name" : null,
>   "timestamp" : "1970-01-01T00:00:12.000Z",
>   "batchId" : 3,
>   "numInputRows" : 0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 0,
> "triggerExecution" : 0
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "MemoryStream[value#474622]",
> "startOffset" : 2,
> "endOffset" : 2,
> "numInputRows" : 0
>   } ],
>   "sink" : {
> "description" : "MemorySink"
>   }
> }
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)
>   
> org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)
> == Progress ==
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: a
>AdvanceManualClock(1000)
>CheckNewAnswer: [a,1]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1000)
>CheckNewAnswer: [b,1]
>AssertOnQuery(, Check total state rows = List(2), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1)
>CheckNewAnswer: [a,-1],[b,2]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>StopStream
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: c
>AdvanceManualClock(11000)
>CheckNewAnswer: [b,-1],[c,1]
> => AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>AdvanceManualClock(12000)
>AssertOnQuery(, )
>AssertOnQuery(, name)
>CheckNewAnswer: [c,-1]
>AssertOnQuery(, Check total state rows = List(0), updated state 
> rows = List(0))
> == Stream ==
> Output Mode: Update
> Stream state: {MemoryStream[value#474622]: 3}
> Thread state: alive
> Thread stack trace: java.lang.Object.wait(Native Method)
> org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
> == Sink ==
> 0: [a,1]
> 1: [b,1]
> 2: [a,-1] [b,2]
> 3: [b,-1] [c,1]
> == Plan ==
> == Parsed Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) 
> AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
>

[jira] [Created] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-08-21 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-25184:
-

 Summary: Flaky test: FlatMapGroupsWithState "streaming with 
processing time timeout"
 Key: SPARK-25184
 URL: https://issues.apache.org/jira/browse/SPARK-25184
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.3.2
Reporter: Tathagata Das


{code}
Assert on query failed: Check total state rows = List(1), updated state rows = 
List(2): Array() did not equal List(1) incorrect total rows, recent progresses:
{
  "id" : "3598002e-0120-4937-8a36-226e0af992b6",
  "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
  "name" : null,
  "timestamp" : "1970-01-01T00:00:12.000Z",
  "batchId" : 3,
  "numInputRows" : 0,
  "durationMs" : {
"getEndOffset" : 0,
"setOffsetRange" : 0,
"triggerExecution" : 0
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "MemoryStream[value#474622]",
"startOffset" : 2,
"endOffset" : 2,
"numInputRows" : 0
  } ],
  "sink" : {
"description" : "MemorySink"
  }
}
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)

org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)

org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)

org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)

org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)

org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)

org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)

org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)

org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)


== Progress ==
   
StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
   AddData to MemoryStream[value#474622]: a
   AdvanceManualClock(1000)
   CheckNewAnswer: [a,1]
   AssertOnQuery(, Check total state rows = List(1), updated state 
rows = List(1))
   AddData to MemoryStream[value#474622]: b
   AdvanceManualClock(1000)
   CheckNewAnswer: [b,1]
   AssertOnQuery(, Check total state rows = List(2), updated state 
rows = List(1))
   AddData to MemoryStream[value#474622]: b
   AdvanceManualClock(1)
   CheckNewAnswer: [a,-1],[b,2]
   AssertOnQuery(, Check total state rows = List(1), updated state 
rows = List(2))
   StopStream
   
StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
   AddData to MemoryStream[value#474622]: c
   AdvanceManualClock(11000)
   CheckNewAnswer: [b,-1],[c,1]
=> AssertOnQuery(, Check total state rows = List(1), updated state 
rows = List(2))
   AdvanceManualClock(12000)
   AssertOnQuery(, )
   AssertOnQuery(, name)
   CheckNewAnswer: [c,-1]
   AssertOnQuery(, Check total state rows = List(0), updated state 
rows = List(0))

== Stream ==
Output Mode: Update
Stream state: {MemoryStream[value#474622]: 3}
Thread state: alive
Thread stack trace: java.lang.Object.wait(Native Method)
org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)


== Sink ==
0: [a,1]
1: [b,1]
2: [a,-1] [b,2]
3: [b,-1] [c,1]


== Plan ==
== Parsed Logical Plan ==
SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS 
_1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]))._2, true, false) AS _2#474631]
+- FlatMapGroupsWithState , cast(value#474625 as string).toString, 
cast(value#474622 as string).toString, [value#474625], [value#474622], 
obj#474629: scala.Tuple2, class[count[0]: bigint], Update, false, 
ProcessingTimeTimeout
   +- AppendColumns , class java.lang.String, 
[StructField(value,StringType,true)], cast(value#474622 as string).toString, 
[staticinvoke(class org.apache.spark.unsafe.types.UTF8String,

[jira] [Assigned] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25163:


Assignee: Apache Spark

> Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with 
> compression
> --
>
> Key: SPARK-25163
> URL: https://issues.apache.org/jira/browse/SPARK-25163
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> I saw it failed multiple times on Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25163:


Assignee: (was: Apache Spark)

> Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with 
> compression
> --
>
> Key: SPARK-25163
> URL: https://issues.apache.org/jira/browse/SPARK-25163
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> I saw it failed multiple times on Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588296#comment-16588296
 ] 

Apache Spark commented on SPARK-25163:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22181

> Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with 
> compression
> --
>
> Key: SPARK-25163
> URL: https://issues.apache.org/jira/browse/SPARK-25163
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> I saw it failed multiple times on Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25174) ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25174:


Assignee: Apache Spark

> ApplicationMaster suspends when unregistering itself from RM with extreme 
> large diagnostic message
> --
>
> Key: SPARK-25174
> URL: https://issues.apache.org/jira/browse/SPARK-25174
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is 
> not about the issue in SPARK-18016 but the side-effect which it brings. When 
> SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the 
> exception contains extreme large error information.
> {code:java}
> ERROR yarn.ApplicationMaster: User class threw exception: 
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown 
> past JVM limit of 0x
> /* 001 */ public java.lang.Object generate(Object[] references) {
> 
> /* 395656 */   mutableRow.update(0, value);
> /* 395657 */ }
> /* 395658 */
> /* 395659 */ return mutableRow;
> /* 395660 */   }
> /* 395661 */ }
> {code}
> The above codegen text is included in the final message for AM to wave 
> goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for 
> YARN-6125 not covering the unregisterApplicationMaster's message truncation. 
> We also create an Jira on YARN Side 
> https://issues.apache.org/jira/browse/YARN-8691 
> Although SPARK-18016 fixed already, there are maybe other uncaught exceptions 
> will cause this problem. I guess that we should limit the error message's 
> size sent to RM while unregistering AM .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25174) ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588283#comment-16588283
 ] 

Apache Spark commented on SPARK-25174:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/22180

> ApplicationMaster suspends when unregistering itself from RM with extreme 
> large diagnostic message
> --
>
> Key: SPARK-25174
> URL: https://issues.apache.org/jira/browse/SPARK-25174
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Kent Yao
>Priority: Major
>
> We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is 
> not about the issue in SPARK-18016 but the side-effect which it brings. When 
> SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the 
> exception contains extreme large error information.
> {code:java}
> ERROR yarn.ApplicationMaster: User class threw exception: 
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown 
> past JVM limit of 0x
> /* 001 */ public java.lang.Object generate(Object[] references) {
> 
> /* 395656 */   mutableRow.update(0, value);
> /* 395657 */ }
> /* 395658 */
> /* 395659 */ return mutableRow;
> /* 395660 */   }
> /* 395661 */ }
> {code}
> The above codegen text is included in the final message for AM to wave 
> goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for 
> YARN-6125 not covering the unregisterApplicationMaster's message truncation. 
> We also create an Jira on YARN Side 
> https://issues.apache.org/jira/browse/YARN-8691 
> Although SPARK-18016 fixed already, there are maybe other uncaught exceptions 
> will cause this problem. I guess that we should limit the error message's 
> size sent to RM while unregistering AM .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25174) ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25174:


Assignee: (was: Apache Spark)

> ApplicationMaster suspends when unregistering itself from RM with extreme 
> large diagnostic message
> --
>
> Key: SPARK-25174
> URL: https://issues.apache.org/jira/browse/SPARK-25174
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Kent Yao
>Priority: Major
>
> We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is 
> not about the issue in SPARK-18016 but the side-effect which it brings. When 
> SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the 
> exception contains extreme large error information.
> {code:java}
> ERROR yarn.ApplicationMaster: User class threw exception: 
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown 
> past JVM limit of 0x
> /* 001 */ public java.lang.Object generate(Object[] references) {
> 
> /* 395656 */   mutableRow.update(0, value);
> /* 395657 */ }
> /* 395658 */
> /* 395659 */ return mutableRow;
> /* 395660 */   }
> /* 395661 */ }
> {code}
> The above codegen text is included in the final message for AM to wave 
> goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for 
> YARN-6125 not covering the unregisterApplicationMaster's message truncation. 
> We also create an Jira on YARN Side 
> https://issues.apache.org/jira/browse/YARN-8691 
> Although SPARK-18016 fixed already, there are maybe other uncaught exceptions 
> will cause this problem. I guess that we should limit the error message's 
> size sent to RM while unregistering AM .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24763) Remove redundant key data from value in streaming aggregation

2018-08-21 Thread Tathagata Das (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588234#comment-16588234
 ] 

Tathagata Das commented on SPARK-24763:
---

They will be. The merge script always puts the major version (i.e. 3.0.0)
there. Those will be redirected to 2.4.0 as well when we make the 2.4.0
release.

On Tue, Aug 21, 2018 at 3:39 PM, Jungtaek Lim (JIRA) 



> Remove redundant key data from value in streaming aggregation
> -
>
> Key: SPARK-24763
> URL: https://issues.apache.org/jira/browse/SPARK-24763
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Key/Value of state in streaming aggregation is formatted as below:
>  * key: UnsafeRow containing group-by fields
>  * value: UnsafeRow containing key fields and another fields for aggregation 
> results
> which data for key is stored to both key and value.
> This is to avoid doing projection row to value while storing, and joining key 
> and value to restore origin row to boost performance, but while doing a 
> simple benchmark test, I found it not much helpful compared to "project and 
> join". (will paste test result in comment)
> So I would propose a new option: remove redundant in stateful aggregation. 
> I'm avoiding to modify default behavior of stateful aggregation, because 
> state value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23131) Stackoverflow using ML and Kryo serializer

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588232#comment-16588232
 ] 

Apache Spark commented on SPARK-23131:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22179

> Stackoverflow using ML and Kryo serializer
> --
>
> Key: SPARK-23131
> URL: https://issues.apache.org/jira/browse/SPARK-23131
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Peigen
>Priority: Minor
>
> When trying to use GeneralizedLinearRegression model and set SparkConf to use 
> KryoSerializer(JavaSerializer is fine)
> It causes StackOverflowException
> {quote}Exception in thread "dispatcher-event-loop-34" 
> java.lang.StackOverflowError
>  at java.util.HashMap.hash(HashMap.java:338)
>  at java.util.HashMap.get(HashMap.java:556)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
> {quote}
> This is very likely to be 
> [https://github.com/EsotericSoftware/kryo/issues/341]
> Upgrade Kryo to 4.0+ probably could fix this
>  
> Wish for upgrade Kryo version for spark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25127) DataSourceV2: Remove SupportsPushDownCatalystFilters

2018-08-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-25127:
---

Assignee: Reynold Xin

> DataSourceV2: Remove SupportsPushDownCatalystFilters
> 
>
> Key: SPARK-25127
> URL: https://issues.apache.org/jira/browse/SPARK-25127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Ryan Blue
>Assignee: Reynold Xin
>Priority: Major
>
> Discussion about adding TableCatalog on the dev list focused around whether 
> Expression should be used in the public DataSourceV2 API, with 
> SupportsPushDownCatalystFilters as an example of where it is already exposed. 
> The early consensus is that Expression should not be exposed in the public 
> API.
> From [~rxin]:
> bq. I completely disagree with using Expression in critical public APIs that 
> we expect a lot of developers to use . . . If we are depending on Expressions 
> on the more common APIs in dsv2 already, we should revisit that.
> The main use of this API is to pass Expression to FileFormat classes that 
> used Expression instead of Filter. External sources also use it for more 
> complex push-down, like {{to_date(ts) = '2018-05-13'}}, but those uses can be 
> done with Analyzer rules or when translating to Filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise

2018-08-21 Thread Steve Loughran (JIRA)

Steve Loughran created SPARK-25183:
--

 Summary: Spark HiveServer2 registers shutdown hook with JVM, not 
ShutdownHookManager; race conditions can arise
 Key: SPARK-25183
 URL: https://issues.apache.org/jira/browse/SPARK-25183
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Steve Loughran


Spark's HiveServer2 registers a shutdown hook with the JVM 
{{Runtime.addShutdownHook()}} which can happen in parallel with the 
ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns in 
an ordered sequence.

This has some risks

* FS shutdown before rename of logs completes, SPARK-6933
* Delays of rename on object stores may block FS close operation, which, on 
clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can force 
a kill of that shutdown hook and other problems.

General outcome: logs aren't present.

Proposed fix: 

* register hook with {{org.apache.spark.util.ShutdownHookManager}}
* HADOOP-15679 to make shutdown wait time configurable, so O(data) renames 
don't trigger timeouts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-21 Thread Bruce Robbins (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588168#comment-16588168
 ] 

Bruce Robbins commented on SPARK-25164:
---

[~viirya] Sure. I will try to get something up by tonight or tomorrow morning 
(Pacific Time).

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-21 Thread Yunjian Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588160#comment-16588160
 ] 

Yunjian Zhang commented on SPARK-25119:
---

create PR as below

[https://github.com/apache/spark/pull/22177]

 

> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-21 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588159#comment-16588159
 ] 

Liang-Chi Hsieh edited comment on SPARK-25164 at 8/21/18 11:30 PM:
---

This looks easy and good to have. [~bersprockets] Do you want to submit a PR 
for this?


was (Author: viirya):
This is easy and looks good to have. [~bersprockets] Do you want to submit a PR 
for this?

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-21 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588159#comment-16588159
 ] 

Liang-Chi Hsieh commented on SPARK-25164:
-

This is easy and looks good to have. [~bersprockets] Do you want to submit a PR 
for this?

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25181) Block Manager master and slave thread pools are unbounded

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25181:


Assignee: (was: Apache Spark)

> Block Manager master and slave thread pools are unbounded
> -
>
> Key: SPARK-25181
> URL: https://issues.apache.org/jira/browse/SPARK-25181
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
> thread pools with unbounded numbers of threads. In certain cases, this can 
> lead to driver OOM errors. We should add an upper bound on the number of 
> threads in these thread pools; this should not break any existing behavior 
> because they still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25181) Block Manager master and slave thread pools are unbounded

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25181:


Assignee: Apache Spark

> Block Manager master and slave thread pools are unbounded
> -
>
> Key: SPARK-25181
> URL: https://issues.apache.org/jira/browse/SPARK-25181
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Major
>
> Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
> thread pools with unbounded numbers of threads. In certain cases, this can 
> lead to driver OOM errors. We should add an upper bound on the number of 
> threads in these thread pools; this should not break any existing behavior 
> because they still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25181) Block Manager master and slave thread pools are unbounded

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588154#comment-16588154
 ] 

Apache Spark commented on SPARK-25181:
--

User 'mukulmurthy' has created a pull request for this issue:
https://github.com/apache/spark/pull/22176

> Block Manager master and slave thread pools are unbounded
> -
>
> Key: SPARK-25181
> URL: https://issues.apache.org/jira/browse/SPARK-25181
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
> thread pools with unbounded numbers of threads. In certain cases, this can 
> lead to driver OOM errors. We should add an upper bound on the number of 
> threads in these thread pools; this should not break any existing behavior 
> because they still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25182) Block Manager master and slave thread pools are unbounded

2018-08-21 Thread Mukul Murthy (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Murthy resolved SPARK-25182.
--
  Resolution: Duplicate
Target Version/s:   (was: 2.4.0)

> Block Manager master and slave thread pools are unbounded
> -
>
> Key: SPARK-25182
> URL: https://issues.apache.org/jira/browse/SPARK-25182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
> thread pools with unbounded numbers of threads. In certain cases, this can 
> lead to driver OOM errors. We should add an upper bound on the number of 
> threads in these thread pools; this should not break any existing behavior 
> because they still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25182) Block Manager master and slave thread pools are unbounded

2018-08-21 Thread Mukul Murthy (JIRA)

Mukul Murthy created SPARK-25182:


 Summary: Block Manager master and slave thread pools are unbounded
 Key: SPARK-25182
 URL: https://issues.apache.org/jira/browse/SPARK-25182
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Mukul Murthy


Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
thread pools with unbounded numbers of threads. In certain cases, this can lead 
to driver OOM errors. We should add an upper bound on the number of threads in 
these thread pools; this should not break any existing behavior because they 
still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25181) Block Manager master and slave thread pools are unbounded

2018-08-21 Thread Mukul Murthy (JIRA)

Mukul Murthy created SPARK-25181:


 Summary: Block Manager master and slave thread pools are unbounded
 Key: SPARK-25181
 URL: https://issues.apache.org/jira/browse/SPARK-25181
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Mukul Murthy


Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
thread pools with unbounded numbers of threads. In certain cases, this can lead 
to driver OOM errors. We should add an upper bound on the number of threads in 
these thread pools; this should not break any existing behavior because they 
still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-21 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588121#comment-16588121
 ] 

Xiao Li commented on SPARK-25114:
-

Let us update the fix version after the fix of 2.2 is merged

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25114.
-
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.4.0
   2.3.2

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25095) Python support for BarrierTaskContext

2018-08-21 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25095.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22085
[https://github.com/apache/spark/pull/22085]

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25095) Python support for BarrierTaskContext

2018-08-21 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25095:
-

Assignee: Jiang Xingbo

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24564) Add test suite for RecordBinaryComparator

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588114#comment-16588114
 ] 

Apache Spark commented on SPARK-24564:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22079

> Add test suite for RecordBinaryComparator
> -
>
> Key: SPARK-24564
> URL: https://issues.apache.org/jira/browse/SPARK-24564
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588115#comment-16588115
 ] 

Apache Spark commented on SPARK-25114:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22079

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.

2018-08-21 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587956#comment-16587956
 ] 

Dilip Biswal edited comment on SPARK-25168 at 8/21/18 10:47 PM:


[~cloud_fan] OK.. Let me close this then since we are unable to do anything 
about this at present and since nothing is seriously impacted. Perhaps we can 
consider making normalizedPlan private so callers are not handed out unresolved 
plans ?


was (Author: dkbiswal):
[~cloud_fan] OK.. Let me close this then since we are unable to do anything 
about this at present and since nothing is seriously impacted. Perhaps we can 
consider making normalizedPlan private so callers are not handled out 
unresolved plans ?

> PlanTest.comparePlans may make a supplied resolved plan unresolved.
> ---
>
> Key: SPARK-25168
> URL: https://issues.apache.org/jira/browse/SPARK-25168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> Hello,
> I came across this behaviour while writing unit test for one of my PR.  We 
> use the utlility class PlanTest for writing plan verification tests. 
> Currently when we call comparePlans and the resolved input plan contains a 
> Join operator, the plan becomes unresolved after the call to 
> `normalizePlan(normalizeExprIds(plan1))`.  The reason for this is that, after 
> cleaning the expression ids, duplicateResolved returns false making the Join 
> operator unresolved.
> {code}
>  def duplicateResolved: Boolean = 
> left.outputSet.intersect(right.outputSet).isEmpty
>   // Joins are only resolved if they don't introduce ambiguous expression ids.
>   // NaturalJoin should be ready for resolution only if everything else is 
> resolved here
>   lazy val resolvedExceptNatural: Boolean = {
> childrenResolved &&
>   expressions.forall(_.resolved) &&
>   duplicateResolved &&
>   condition.forall(_.dataType == BooleanType)
>   }
> {code}
> Please note that, plan verification actually works fine. It just looked 
> awkward to compare two unresolved plan for equality.
> I am opening this ticket to discuss if it is an okay behaviour. If its an 
> tolerable behaviour then we can close the ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24763) Remove redundant key data from value in streaming aggregation

2018-08-21 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588101#comment-16588101
 ] 

Jungtaek Lim commented on SPARK-24763:
--

[~tdas]

One question regarding fix version: I guess we still don't cut the Spark 2.4 
branch, so I expect everything on master will be included to Spark 2.4. Do I 
understand correctly? Because we have 4 issues marked fix version to only 
3.0.0, so curious which version they're being included.

https://issues.apache.org/jira/browse/SPARK-24763?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%203.0.0

 

> Remove redundant key data from value in streaming aggregation
> -
>
> Key: SPARK-24763
> URL: https://issues.apache.org/jira/browse/SPARK-24763
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Key/Value of state in streaming aggregation is formatted as below:
>  * key: UnsafeRow containing group-by fields
>  * value: UnsafeRow containing key fields and another fields for aggregation 
> results
> which data for key is stored to both key and value.
> This is to avoid doing projection row to value while storing, and joining key 
> and value to restore origin row to boost performance, but while doing a 
> simple benchmark test, I found it not much helpful compared to "project and 
> join". (will paste test result in comment)
> So I would propose a new option: remove redundant in stateful aggregation. 
> I'm avoiding to modify default behavior of stateful aggregation, because 
> state value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24441) Expose total estimated size of states in HDFSBackedStateStoreProvider

2018-08-21 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24441:
-

Assignee: Jungtaek Lim

> Expose total estimated size of states in HDFSBackedStateStoreProvider
> -
>
> Key: SPARK-24441
> URL: https://issues.apache.org/jira/browse/SPARK-24441
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> While Spark exposes state metrics for single state, Spark still doesn't 
> expose overall memory usage of state (loadedMaps) in 
> HDFSBackedStateStoreProvider. 
> The rationalize of the patch is that state backed by 
> HDFSBackedStateStoreProvider will consume more memory than the number what we 
> can get from query status due to caching multiple versions of states. The 
> memory footprint to be much larger than query status reports in situations 
> where the state store is getting a lot of updates: while shallow-copying map 
> incurs additional small memory usages due to the size of map entities and 
> references, but row objects will still be shared across the versions. If 
> there're lots of updates between batches, less row objects will be shared and 
> more row objects will exist in memory consuming much memory then what we 
> expect.
> It would be better to expose it as well so that end users can determine 
> actual memory usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24441) Expose total estimated size of states in HDFSBackedStateStoreProvider

2018-08-21 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24441.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21469
[https://github.com/apache/spark/pull/21469]

> Expose total estimated size of states in HDFSBackedStateStoreProvider
> -
>
> Key: SPARK-24441
> URL: https://issues.apache.org/jira/browse/SPARK-24441
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> While Spark exposes state metrics for single state, Spark still doesn't 
> expose overall memory usage of state (loadedMaps) in 
> HDFSBackedStateStoreProvider. 
> The rationalize of the patch is that state backed by 
> HDFSBackedStateStoreProvider will consume more memory than the number what we 
> can get from query status due to caching multiple versions of states. The 
> memory footprint to be much larger than query status reports in situations 
> where the state store is getting a lot of updates: while shallow-copying map 
> incurs additional small memory usages due to the size of map entities and 
> references, but row objects will still be shared across the versions. If 
> there're lots of updates between batches, less row objects will be shared and 
> more row objects will exist in memory consuming much memory then what we 
> expect.
> It would be better to expose it as well so that end users can determine 
> actual memory usage for state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24763) Remove redundant key data from value in streaming aggregation

2018-08-21 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24763:
-

Assignee: Jungtaek Lim  (was: Tathagata Das)

> Remove redundant key data from value in streaming aggregation
> -
>
> Key: SPARK-24763
> URL: https://issues.apache.org/jira/browse/SPARK-24763
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Key/Value of state in streaming aggregation is formatted as below:
>  * key: UnsafeRow containing group-by fields
>  * value: UnsafeRow containing key fields and another fields for aggregation 
> results
> which data for key is stored to both key and value.
> This is to avoid doing projection row to value while storing, and joining key 
> and value to restore origin row to boost performance, but while doing a 
> simple benchmark test, I found it not much helpful compared to "project and 
> join". (will paste test result in comment)
> So I would propose a new option: remove redundant in stateful aggregation. 
> I'm avoiding to modify default behavior of stateful aggregation, because 
> state value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24763) Remove redundant key data from value in streaming aggregation

2018-08-21 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24763:
-

Assignee: Tathagata Das

> Remove redundant key data from value in streaming aggregation
> -
>
> Key: SPARK-24763
> URL: https://issues.apache.org/jira/browse/SPARK-24763
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Key/Value of state in streaming aggregation is formatted as below:
>  * key: UnsafeRow containing group-by fields
>  * value: UnsafeRow containing key fields and another fields for aggregation 
> results
> which data for key is stored to both key and value.
> This is to avoid doing projection row to value while storing, and joining key 
> and value to restore origin row to boost performance, but while doing a 
> simple benchmark test, I found it not much helpful compared to "project and 
> join". (will paste test result in comment)
> So I would propose a new option: remove redundant in stateful aggregation. 
> I'm avoiding to modify default behavior of stateful aggregation, because 
> state value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24763) Remove redundant key data from value in streaming aggregation

2018-08-21 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24763.
---
   Resolution: Done
Fix Version/s: 3.0.0
   2.4.0

> Remove redundant key data from value in streaming aggregation
> -
>
> Key: SPARK-24763
> URL: https://issues.apache.org/jira/browse/SPARK-24763
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Key/Value of state in streaming aggregation is formatted as below:
>  * key: UnsafeRow containing group-by fields
>  * value: UnsafeRow containing key fields and another fields for aggregation 
> results
> which data for key is stored to both key and value.
> This is to avoid doing projection row to value while storing, and joining key 
> and value to restore origin row to boost performance, but while doing a 
> simple benchmark test, I found it not much helpful compared to "project and 
> join". (will paste test result in comment)
> So I would propose a new option: remove redundant in stateful aggregation. 
> I'm avoiding to modify default behavior of stateful aggregation, because 
> state value will not be compatible between current and option enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-25149.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22139
[https://github.com/apache/spark/pull/22139]

> Personalized PageRank raises an error if vertexIDs are > MaxInt
> ---
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-25149:
-

Assignee: Bago Amirbekian

> Personalized PageRank raises an error if vertexIDs are > MaxInt
> ---
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25149:
--
Summary: Personalized PageRank raises an error if vertexIDs are > MaxInt  
(was: Personalized Page Rank raises an error if vertexIDs are > MaxInt)

> Personalized PageRank raises an error if vertexIDs are > MaxInt
> ---
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24307:

Labels: release-notes  (was: releasenotes)

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that

[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24307:

Labels: releasenotes  (was: )

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that

[jira] [Commented] (SPARK-25050) Handle more than two types in avro union types when writing avro files

2018-08-21 Thread DB Tsai (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588068#comment-16588068
 ] 

DB Tsai commented on SPARK-25050:
-

We handle more than two types in avro when reading into spark, but not writing 
out avro files with specified schema.

> Handle more than two types in avro union types when writing avro files
> --
>
> Key: SPARK-25050
> URL: https://issues.apache.org/jira/browse/SPARK-25050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25050) Handle more than two types in avro union types when writing avro files

2018-08-21 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-25050:

Summary: Handle more than two types in avro union types when writing avro 
files  (was: Handle more than two types in avro union types)

> Handle more than two types in avro union types when writing avro files
> --
>
> Key: SPARK-25050
> URL: https://issues.apache.org/jira/browse/SPARK-25050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-21 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588053#comment-16588053
 ] 

Steve Loughran commented on SPARK-6305:
---

1. exclusion of log4j 1.x

you can only safely exclude it if your dependencies don't expect it on their 
CP. Otherwise: stack traces.

See: HADOOP-12956 HBASE-10092 for how these have moved/are slowly moving off 
commons logging to SLF4J for their APIs, so you can swap in log4j 2.x. Once 
they are all on SLF4J, then the code-level switch should be straightforward, 
leaving on the configuration problem #4.

2: use of low level log4j in tests.

Usually done to either look for log entries in output, or to turn back logging. 
If you can move that, then it's possible --but it will be a one-way trip

3. Spark logging use of log4j

because spark s"strings" already handle closure stuff, key bits of the log4j 
APIs in java 8 aren't so compelling, but there are nice things in the new 
runtime, which is why that move to SLF4J has been going on everywhere. It's 
just a slow job because it touches pretty much every single .java file in every 
single JAR which spark imports.

4.  "There are a few log4.properties file that would have to be converted to 
the log4j2 format."

This is actually one of the biggest barriers to switching to log4j 2. in 
production: all the log4j.properties files *everywhere* will need to move, and 
all cluster management tools which work with those files are going to need 
rework & re-releases, unless/until logj4 has that promised support for the old 
files.



> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2018-08-21 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588035#comment-16588035
 ] 

Steve Loughran commented on SPARK-25180:


FWIW, there was no in-progress data at the dest store, so either task commit 
process failed before the actual output committer's commitTask() method was 
called, or the abort operation cleaned up before that notification message 
failed to send. Either way: no in-flight data to clean up.

> Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname 
> fails
> ---
>
> Key: SPARK-25180
> URL: https://issues.apache.org/jira/browse/SPARK-25180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: mac laptop running on a corporate guest wifi, presumably 
> a wifi with odd DNS settings.
>Reporter: Steve Loughran
>Priority: Minor
>
> trying to save work on spark standalone can fail if netty RPC cannot 
> determine the hostname. While that's a valid failure on a real cluster, in 
> standalone falling back to localhost rather than inferred "hw13176.lan" value 
> may be the better option.
> note also, the abort* call failed; NPE.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host

2018-08-21 Thread Yinan Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588014#comment-16588014
 ] 

Yinan Li commented on SPARK-25162:
--

We actually moved away from using the IP address of the driver pod to set 
{{spark.driver.host}}, to using a headless service to give the driver pod a 
FQDN name and set {{spark.driver.host}} to the FQDN name. Internally, we set 
{{spark.driver.bindAddress}} to the value of environment variable 
{{SPARK_DRIVER_BIND_ADDRESS}} which gets its value from the IP address of the 
pod using the downward API. We could do the same for 
{{spark.kubernetes.driver.pod.name}} as you suggested for in-cluster client 
mode.

> Kubernetes 'in-cluster' client mode and value of spark.driver.host
> --
>
> Key: SPARK-25162
> URL: https://issues.apache.org/jira/browse/SPARK-25162
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: A java program, deployed to kubernetes, that establishes 
> a Spark Context in client mode. 
> Not using spark-submit.
> Kubernetes 1.10
> AWS EKS
>  
>  
>Reporter: James Carter
>Priority: Minor
>
> When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
> for spark.driver.host can be derived from the IP address of the driver pod.
> I observed that the value of _spark.driver.host_ defaulted to the value of 
> _spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This 
> caused the executors to fail to establish a connection back to the driver.
> As a work around, in my configuration I pass the driver's pod name _and_ the 
> driver's ip address to ensure that executors can establish a connection with 
> the driver.
> _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
> metadata.name
> _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp
> e.g.
> Deployment:
> {noformat}
> env:
> - name: DRIVER_POD_NAME
>   valueFrom:
> fieldRef:
>   fieldPath: metadata.name
> - name: DRIVER_POD_IP
>   valueFrom:
> fieldRef:
>   fieldPath: status.podIP
> {noformat}
>  
> Application Properties:
> {noformat}
> config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME}
> config[spark.driver.host]: ${DRIVER_POD_IP}
> {noformat}
>  
> BasicExecutorFeatureStep.scala:
> {code:java}
> private val driverUrl = RpcEndpointAddress(
> kubernetesConf.get("spark.driver.host"),
> kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT),
> CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
> {code}
>  
> Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in 
> this deployment scenario.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2018-08-21 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588012#comment-16588012
 ] 

Steve Loughran commented on SPARK-25180:


Netty converts the UnknownHostException into an IOE in the process of adding 
the hostname info, so it'd be hard for a caller to catch the exception and 
react meaningfully to it.


> Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname 
> fails
> ---
>
> Key: SPARK-25180
> URL: https://issues.apache.org/jira/browse/SPARK-25180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: mac laptop running on a corporate guest wifi, presumably 
> a wifi with odd DNS settings.
>Reporter: Steve Loughran
>Priority: Minor
>
> trying to save work on spark standalone can fail if netty RPC cannot 
> determine the hostname. While that's a valid failure on a real cluster, in 
> standalone falling back to localhost rather than inferred "hw13176.lan" value 
> may be the better option.
> note also, the abort* call failed; NPE.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25149:
--
Shepherd: Joseph K. Bradley

> Personalized Page Rank raises an error if vertexIDs are > MaxInt
> 
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2018-08-21 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588005#comment-16588005
 ] 

Steve Loughran commented on SPARK-25180:


Stack
{code}
scala> text("hello all!")
res10: String =
"
  _  _   _ _   _   _
 | |__ ___  | | | |   ___   __ _  | | | | | |
 | '_ \   / _ \ | | | |  / _ \ / _` | | | | | | |
 | | | | |  __/ | | | | | (_) |   | (_| | | | | | |_|
 |_| |_|  \___| |_| |_|  \___/ \__,_| |_| |_| (_)

"

scala> val landsatCsvGZOnS3 = path("s3a://landsat-pds/scene_list.gz")
landsatCsvGZOnS3: org.apache.hadoop.fs.Path = s3a://landsat-pds/scene_list.gz

scala>

scala> val landsatCsvGZ = path("file:///tmp/scene_list.gz")
landsatCsvGZ: org.apache.hadoop.fs.Path = file:/tmp/scene_list.gz

scala> copyFile(landsatCsvGZOnS3, landsatCsvGZ, conf, false)
18/08/21 11:00:56 INFO FluentPropertyBeanIntrospector: Error when creating 
PropertyDescriptor for public final void 
org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)!
 Ignoring this property.
18/08/21 11:00:56 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
18/08/21 11:00:56 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 
10 second(s).
18/08/21 11:00:56 INFO MetricsSystemImpl: s3a-file-system metrics system started
18/08/21 11:00:59 INFO deprecation: fs.s3a.server-side-encryption-key is 
deprecated. Instead, use fs.s3a.server-side-encryption.key
18/08/21 11:00:59 INFO ObjectStoreOperations: Copying 
s3a://landsat-pds/scene_list.gz to file:/tmp/scene_list.gz (44534 KB)
18/08/21 11:04:36 INFO ObjectStoreOperations: Copy Duration = 216.47322015 
seconds
18/08/21 11:04:36 INFO ObjectStoreOperations: Effective copy bandwidth = 
205.72521612207376 KiB/s
res11: Boolean = true

scala> val csvSchema = LandsatIO.buildCsvSchema()
csvSchema: org.apache.spark.sql.types.StructType = 
StructType(StructField(entityId,StringType,true), 
StructField(acquisitionDate,StringType,true), 
StructField(cloudCover,StringType,true), 
StructField(processingLevel,StringType,true), 
StructField(path,StringType,true), StructField(row,StringType,true), 
StructField(min_lat,StringType,true), StructField(min_lon,StringType,true), 
StructField(max_lat,StringType,true), StructField(max_lon,StringType,true), 
StructField(download_url,StringType,true))

scala> val csvDataFrame = LandsatIO.addLandsatColumns(
 |   spark.read.options(LandsatIO.CsvOptions).
 | schema(csvSchema).csv(landsatCsvGZ.toUri.toString))
18/08/21 11:34:24 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/Users/stevel/Projects/sparkwork/spark/spark-warehouse').
18/08/21 11:34:24 INFO SharedState: Warehouse path is 
'file:/Users/stevel/Projects/sparkwork/spark/spark-warehouse'.
18/08/21 11:34:24 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
csvDataFrame: org.apache.spark.sql.DataFrame = [id: string, acquisitionDate: 
timestamp ... 12 more fields]

scala> val filteredCSV = csvDataFrame.
 |   sample(false, 0.01d).
 |   filter("cloudCover < 0").cache()
18/08/21 11:34:44 INFO FileSourceStrategy: Pruning directories with:
18/08/21 11:34:44 INFO FileSourceStrategy: Post-Scan Filters:
18/08/21 11:34:44 INFO FileSourceStrategy: Output Data Schema: struct
18/08/21 11:34:44 INFO FileSourceScanExec: Pushed Filters:
18/08/21 11:34:45 INFO CodeGenerator: Code generated in 239.359919 ms
18/08/21 11:34:45 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 347.6 KB, free 366.0 MB)
18/08/21 11:34:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 29.3 KB, free 365.9 MB)
18/08/21 11:34:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
hw13176.lan:52731 (size: 29.3 KB, free: 366.3 MB)
18/08/21 11:34:45 INFO SparkContext: Created broadcast 0 from cache at 
:46
18/08/21 11:34:45 INFO FileSourceScanExec: Planning scan with bin packing, max 
size: 6224701 bytes, open cost is considered as scanning 4194304 bytes.
filteredCSV: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: 
string, acquisitionDate: timestamp ... 12 more fields]

scala> ireland
:43: error: not found: value ireland
   ireland
   ^

scala> val ireland = new Path("s3a://hwdev-steve-ireland-new/")
ireland: org.apache.hadoop.fs.Path = s3a://hwdev-steve-ireland-new/

scala> val orcData = new Path(ireland, "orcData")
orcData: org.apache.hadoop.fs.Path = s3a://hwdev-steve-ireland-new/orcData

scala> logDuration("write to ORC S3A") {
 |   filteredCSV.write.
 | partitionBy("year", "month").
 | mode(SaveMode.Append).format("orc").save(orcData.toString)
 | }
18/08/21 11:35:44 INFO ObjectStoreOperations: Starting write to ORC S3A
18/08/21 11:35:50 INFO

[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2018-08-21 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588001#comment-16588001
 ] 

Steve Loughran commented on SPARK-25180:


code snippet was some trivial CSV => ORC with both src and dest on s3a URLs

{code}
  filteredCSV.write.
partitionBy("year", "month").
mode(SaveMode.Append).format("orc").save(orcData.toString)
{code}

> Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname 
> fails
> ---
>
> Key: SPARK-25180
> URL: https://issues.apache.org/jira/browse/SPARK-25180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: mac laptop running on a corporate guest wifi, presumably 
> a wifi with odd DNS settings.
>Reporter: Steve Loughran
>Priority: Minor
>
> trying to save work on spark standalone can fail if netty RPC cannot 
> determine the hostname. While that's a valid failure on a real cluster, in 
> standalone falling back to localhost rather than inferred "hw13176.lan" value 
> may be the better option.
> note also, the abort* call failed; NPE.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2018-08-21 Thread Steve Loughran (JIRA)

Steve Loughran created SPARK-25180:
--

 Summary: Spark standalone failure in Utils.doFetchFile() if 
nslookup of local hostname fails
 Key: SPARK-25180
 URL: https://issues.apache.org/jira/browse/SPARK-25180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: mac laptop running on a corporate guest wifi, presumably 
a wifi with odd DNS settings.
Reporter: Steve Loughran


trying to save work on spark standalone can fail if netty RPC cannot determine 
the hostname. While that's a valid failure on a real cluster, in standalone 
falling back to localhost rather than inferred "hw13176.lan" value may be the 
better option.

note also, the abort* call failed; NPE.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-21 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587962#comment-16587962
 ] 

Sean Owen commented on SPARK-6305:
--

It was a mess. Excluding dependencies is an ongoing issue because other 
dependencies change their dependencies. However, I think tools like the 
enforcer plugin can help detect this.

I think the code has to use some specific logging framework to do its work, but 
it can be shaded to keep it out of the user classpath. This could be a good 
goal for Spark 3.0. However i am also not sure how much trouble the existing 
log4j 1.x causes users?

Yes, it does entail updating users and tests' log configuration, which is why 
it might be something for 3.0. But sure, if there's no real downside and some 
upsides to doing it, I'm not against it.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.

2018-08-21 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587956#comment-16587956
 ] 

Dilip Biswal commented on SPARK-25168:
--

[~cloud_fan] OK.. Let me close this then since we are unable to do anything 
about this at present and since nothing is seriously impacted. Perhaps we can 
consider making normalizedPlan private so callers are not handled out 
unresolved plans ?

> PlanTest.comparePlans may make a supplied resolved plan unresolved.
> ---
>
> Key: SPARK-25168
> URL: https://issues.apache.org/jira/browse/SPARK-25168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> Hello,
> I came across this behaviour while writing unit test for one of my PR.  We 
> use the utlility class PlanTest for writing plan verification tests. 
> Currently when we call comparePlans and the resolved input plan contains a 
> Join operator, the plan becomes unresolved after the call to 
> `normalizePlan(normalizeExprIds(plan1))`.  The reason for this is that, after 
> cleaning the expression ids, duplicateResolved returns false making the Join 
> operator unresolved.
> {code}
>  def duplicateResolved: Boolean = 
> left.outputSet.intersect(right.outputSet).isEmpty
>   // Joins are only resolved if they don't introduce ambiguous expression ids.
>   // NaturalJoin should be ready for resolution only if everything else is 
> resolved here
>   lazy val resolvedExceptNatural: Boolean = {
> childrenResolved &&
>   expressions.forall(_.resolved) &&
>   duplicateResolved &&
>   condition.forall(_.dataType == BooleanType)
>   }
> {code}
> Please note that, plan verification actually works fine. It just looked 
> awkward to compare two unresolved plan for equality.
> I am opening this ticket to discuss if it is an okay behaviour. If its an 
> tolerable behaviour then we can close the ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-21 Thread Alexander (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587928#comment-16587928
 ] 

Alexander commented on SPARK-7768:
--

It's been a while since this had any activity. What is the difficulty level of 
getting UDTRegistration up to a level where it can become public? I noticed 
that you need to use spark-inner types in org.apache.spark.unsafe like 
UTF8String instead of String etc... but this is a minor inconvenience compared 
to the annoyance-level of not being able to use even things like custom-enum 
types (think [Ennumeratum|https://github.com/lloydmeta/enumeratum] for 
instance) inside of records.

This is probably the most misunderstood part of Spark in general. For instance, 
vaquarkhan's [gigantic 
post|https://github.com/vaquarkhan/vk-wiki-notes/wiki/Apache-Spark-custom-Encoder-example]
 about wrangling with this issue. Other people have written custom encoders to 
solve particular use-cases e.g. 
[here|https://typelevel.org/frameless/Injection.html], and 
[here|https://github.com/gennady-lebedev/spark-enum-encoder]. There is a 
general cacaphony of fustration and cluenessness because you have to dig into 
ExpressionEncoder in order to be able to understand why it's happening.

Come on guys! Spark is probably the most awesome analytics engine in the world. 
Why can't we solve this problem together???

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587897#comment-16587897
 ] 

Kazuaki Ishizaki commented on SPARK-25178:
--

[~rednaxelafx] Thank you for opening a JIRA entry :)
[~smilegator] I can take this.

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25179:
---

Assignee: Bryan Cutler

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25179:

Description: binary type support requires pyarrow 0.10.0.   (was: binary 
type support requires pyarrow 0.10.0)

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587884#comment-16587884
 ] 

Xiao Li commented on SPARK-25179:
-

Thanks! It is not urgent as long as it is documented before we publish the doc 
for spark 2.4

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>
> binary type support requires pyarrow 0.10.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587882#comment-16587882
 ] 

Bryan Cutler commented on SPARK-25179:
--

I can work on this, probably can't get to it right away tho

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>
> binary type support requires pyarrow 0.10.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25179:
-
Description: binary type support requires pyarrow 0.10.0

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>
> binary type support requires pyarrow 0.10.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22779) ConfigEntry's default value should actually be a value

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587869#comment-16587869
 ] 

Apache Spark commented on SPARK-22779:
--

User 'GregOwen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22174

> ConfigEntry's default value should actually be a value
> --
>
> Key: SPARK-22779
> URL: https://issues.apache.org/jira/browse/SPARK-22779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Reynold Xin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> ConfigEntry's config value right now shows a human readable message. In some 
> places in SQL we actually rely on default value for real to be setting the 
> values. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6236) Support caching blocks larger than 2G

2018-08-21 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-6236.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

As mentioned above, this was covered elsewhere, parts of this even before spark 
2.4.  See parent SPARK-6235 for more details.

> Support caching blocks larger than 2G
> -
>
> Key: SPARK-6236
> URL: https://issues.apache.org/jira/browse/SPARK-6236
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> Due to the use java.nio.ByteBuffer, BlockManager does not support blocks 
> larger than 2G. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24961) sort operation causes out of memory

2018-08-21 Thread Markus Breuer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587850#comment-16587850
 ] 

Markus Breuer commented on SPARK-24961:
---

Some notes to your last comment:

Reducing number of partitions is no option. When reducing number of partitions 
the size per partition grows up. A partition is a unit of work per task. 
Smaller partitions will reduce a tasks memory consumptions. Spark has a default 
for 200 partitions and is a good idea to increase number amount of partitions 
for very large data sets. Is this correct?

Any idea why the proposal to modify 
{{spark.sql.execution.rangeExchange.sampleSizePerPartition}} has no effect?

> sort operation causes out of memory 
> 
>
> Key: SPARK-24961
> URL: https://issues.apache.org/jira/browse/SPARK-24961
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1
> Environment: Java 1.8u144+
> Windows 10
> Spark 2.3.1 in local mode
> -Xms4g -Xmx4g
> optional: -XX:+UseParallelOldGC 
>Reporter: Markus Breuer
>Priority: Major
>
> A sort operation on large rdd - which does not fit in memory - causes out of 
> memory exception. I made the effect reproducable by an sample, the sample 
> creates large object of about 2mb size. When saving result the oom occurs. I 
> tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, 
> MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY 
> seems to work.
> When replacing sort() with sortWithinPartitions() no StorageLevel is required 
> and application succeeds.
> {code:java}
> package de.bytefusion.examples;
> import breeze.storage.Storage;
> import de.bytefusion.utils.Options;
> import org.apache.hadoop.io.MapFile;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.SequenceFileOutputFormat;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import scala.Tuple2;
> import static org.apache.spark.sql.functions.*;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.UUID;
> import java.util.stream.Collectors;
> import java.util.stream.IntStream;
> public class Example3 {
> public static void main(String... args) {
> // create spark session
> SparkSession spark = SparkSession.builder()
> .appName("example1")
> .master("local[4]")
> .config("spark.driver.maxResultSize","1g")
> .config("spark.driver.memory","512m")
> .config("spark.executor.memory","512m")
> .config("spark.local.dir","d:/temp/spark-tmp")
> .getOrCreate();
> JavaSparkContext sc = 
> JavaSparkContext.fromSparkContext(spark.sparkContext());
> // base to generate huge data
> List list = new ArrayList<>();
> for (int val = 1; val < 1; val++) {
> int valueOf = Integer.valueOf(val);
> list.add(valueOf);
> }
> // create simple rdd of int
> JavaRDD rdd = sc.parallelize(list,200);
> // use map to create large object per row
> JavaRDD rowRDD =
> rdd
> .map(value -> 
> RowFactory.create(String.valueOf(value), 
> createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024)))
> // no persist => out of memory exception on write()
> // persist MEMORY_AND_DISK => out of memory exception 
> on write()
> // persist MEMORY_AND_DISK_SER => out of memory 
> exception on write()
> // persist(StorageLevel.DISK_ONLY())
> ;
> StructType type = new StructType();
> type = type
> .add("c1", DataTypes.StringType)
> .add( "c2", DataTypes.StringType );
> Dataset df = spark.createDataFrame(rowRDD, type);
> // works
> df.show();
> df = df
> .sort(col("c1").asc() )
> ;
> df.explain();
> // takes a lot of time but works
> df.show();
> // OutOfMemoryError: java heap space
> df
> .write()
> .mode("overwrite")
> .csv("d:/temp/my.csv");
> // OutOfMemoryError: java heap space
> df
> .toJavaRDD()
> .mapToPair(row -> new Tuple2(new Text(row.getString(0)), new 
> Text(

[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587847#comment-16587847
 ] 

Xiao Li commented on SPARK-25179:
-

cc [~bryanc]

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25179:

Issue Type: Documentation  (was: Improvement)

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-08-21 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25179:
---

 Summary: Document the features that require Pyarrow 0.10
 Key: SPARK-25179
 URL: https://issues.apache.org/jira/browse/SPARK-25179
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
 Environment: Document the features that require Pyarrow 0.10 . For 
example, https://github.com/apache/spark/pull/20725
Reporter: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587841#comment-16587841
 ] 

Xiao Li commented on SPARK-25178:
-

cc [~kiszk] Do you have a bandwidth to take this?

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24296) Support replicating blocks larger than 2 GB

2018-08-21 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24296.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21451
[https://github.com/apache/spark/pull/21451]

> Support replicating blocks larger than 2 GB
> ---
>
> Key: SPARK-24296
> URL: https://issues.apache.org/jira/browse/SPARK-24296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Replicating blocks send the entire block data in one frame.  This results in 
> a failure on the receiving end for blocks larger than 2GB.
> We should change block replication to send the block data as a stream when 
> the block is large (building on the network changes in SPARK-6237).  This can 
> use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate 
> as a stream, the same as we do for fetching shuffle blocks and fetching 
> remote RDD blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24296) Support replicating blocks larger than 2 GB

2018-08-21 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24296:
--

Assignee: Imran Rashid

> Support replicating blocks larger than 2 GB
> ---
>
> Key: SPARK-24296
> URL: https://issues.apache.org/jira/browse/SPARK-24296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>
> Replicating blocks send the entire block data in one frame.  This results in 
> a failure on the receiving end for blocks larger than 2GB.
> We should change block replication to send the block data as a stream when 
> the block is large (building on the network changes in SPARK-6237).  This can 
> use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate 
> as a stream, the same as we do for fetching shuffle blocks and fetching 
> remote RDD blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kris Mok (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587820#comment-16587820
 ] 

Kris Mok commented on SPARK-25178:
--

FYI: I've just come across this behavior and raised a ticket so that we don't 
forget about it, but I'm not working on this one (yet).

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kris Mok (JIRA)

Kris Mok created SPARK-25178:


 Summary: Use dummy name for xxxHashMapGenerator key/value schema 
field
 Key: SPARK-25178
 URL: https://issues.apache.org/jira/browse/SPARK-25178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
generated field name of the keySchema / valueSchema to a dummy name instead of 
using {{key.name}}.

In previous discussion from SPARK-18952's PR [1], it was already suggested that 
the field names were being used, so it's not worth capturing the strings as 
reference objects here. Josh suggested merging the original fix as-is due to 
backportability / pickability concerns. Now that we're coming up to a new 
release, this can be revisited.

[1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-21 Thread Chris Martin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587814#comment-16587814
 ] 

Chris Martin commented on SPARK-6305:
-

[~srowen]

I've taken a look at this and I'd like to sync up to see your perspective 
(given that you already looked at this).  From what I can see:

1) As you say there's a load of pom manipulation to do in order to exclude 
log4j1 and related components from dependencies. 

2) There are a bunch of unit tests that rely on hooking into log4j internals in 
order to capture log output and make assertions against them.

3) The org.apache.spark.internal.Logging trait has some fairly low level log4j 
logic.

4) There are a few log4.properties file that would have to be converted to the 
log4j2 format.

Of these 1) is a pain but it's fairly mechanical and I would hope we could 
write some sort of automated check to tell us if log4j1 is still being brought 
in.  2) is fairly easily solvable; I have some code to do this.  3) worries me 
as this class is doing some fairly hairy stuff and I'm not sure of the use 
cases- it would be good to have a chat about this. 4) is simple enough as far 
as spark goes, but the fly in the ointment is that all existing user log 
configuration would need to be changed.

thoughts?

 

Chris

 

 

 

 

 

 

 

 

 

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24335) Dataset.map schema not applied in some cases

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587785#comment-16587785
 ] 

Apache Spark commented on SPARK-24335:
--

User 'redsanket' has created a pull request for this issue:
https://github.com/apache/spark/pull/22173

> Dataset.map schema not applied in some cases
> 
>
> Key: SPARK-24335
> URL: https://issues.apache.org/jira/browse/SPARK-24335
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> In the following code an {color:#808080}UnsupportedOperationException{color} 
> is thrown in the filter() call just after the Dateset.map() call unless 
> withWatermark() is added between them. The error reports 
> `{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
> expect the map() method to have applied the schema and for it to be 
> accessible in filter().  Without the extra withWatermark() call my debugger 
> reports that the `row` objects in the filter lambda are `GenericRow`.  With 
> the watermark call it reports that they are `GenericRowWithSchema`.
> I should add that I'm new to working with Structured Streaming.  So if I'm 
> overlooking some implied dependency please fill me in.
> I'm encountering this in new code for a new production job. The presented 
> code is distilled down to demonstrate the problem.  While the problem can be 
> worked around simply by adding withWatermark() I'm concerned that this will 
> leave the code in a fragile state.  With this simplified code if this error 
> occurs again it will be easy to identify what change led to the error.  But 
> in the code I'm writing, with this functionality delegated to other classes, 
> it is (and has been) very challenging to identify the cause.
>  
> {code:java}
> public static void main(String[] args) {
> SparkSession sparkSession = 
> SparkSession.builder().master("local").getOrCreate();
> sparkSession.conf().set(
> "spark.sql.streaming.checkpointLocation",
> "hdfs://localhost:9000/search_relevance/checkpoint" // for spark 
> 2.3
> // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
> for spark 2.1
> );
> StructType inSchema = DataTypes.createStructType(
> new StructField[] {
> DataTypes.createStructField("id", DataTypes.StringType
>   , false),
> DataTypes.createStructField("ts", DataTypes.TimestampType 
>   , false),
> DataTypes.createStructField("f1", DataTypes.LongType  
>   , true)
> }
> );
> Dataset rawSet = sparkSession.sqlContext().readStream()
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> .map(   (MapFunction) raw -> {
> Object[] fields = new Object[3];
> fields[0] = "id1";
> fields[1] = raw.getAs("timestamp");
> fields[2] = raw.getAs("value");
> return RowFactory.create(fields);
> },
> RowEncoder.apply(inSchema)
> )
> // If withWatermark() is included above the filter() line then 
> this works.  Without it we get:
> //Caused by: java.lang.UnsupportedOperationException: 
> fieldIndex on a Row without schema is undefined.
> // at the row.getAs() call.
> // .withWatermark("ts", "10 seconds")  // <-- This is required 
> for row.getAs("f1") to work ???
> .filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
> .withWatermark("ts", "10 seconds")
> ;
> StreamingQuery streamingQuery = rawSet
> .select("*")
> .writeStream()
> .format("console")
> .outputMode("append")
> .start();
> try {
> streamingQuery.awaitTermination(30_000);
> } catch (StreamingQueryException e) {
> System.out.println("Caught exception at 'awaitTermination':");
> e.printStackTrace();
> }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-21 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587770#comment-16587770
 ] 

Bryan Cutler commented on SPARK-23874:
--

Yes, I still would recommend users upgrade pyarrow if possible. There were a 
large number of fixes and improvements made since 0.8.0.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25173) fail to pass kerberos authentification on executers

2018-08-21 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25173.

Resolution: Duplicate

> fail to pass kerberos authentification on executers
> ---
>
> Key: SPARK-25173
> URL: https://issues.apache.org/jira/browse/SPARK-25173
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: sun zhiwei
>Priority: Major
>
> Hi All,
>    for some reasons,I must use jdbc to connect hive server2 on 
> executors,but when i test my program,it get the following log:
>   javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>     at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>     at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>     at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>     at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>     at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>     at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>     at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>     at 
> org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
>     at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:168)
>     at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
>     at java.sql.DriverManager.getConnection(DriverManager.java:571)
>     at java.sql.DriverManager.getConnection(DriverManager.java:187)
>     at 
> com.hypers.bigdata.util.SparkHandler$.handleTable2(SparkHandler.scala:204)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:196)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:107)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>     at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:107)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:106)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)
>     at 
> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
>     at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
>     at 
> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
>     at 
> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
>     at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
>     at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
>     at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
>     ... 31 more
>  
>  I think it might because the executors can not pass the kerberos 
> authentification,so here's my question：
>  how could I make the executors pass kerberos authentification?
>  any suggestions would be prossiblely usefully,thx.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Resolved] (SPARK-25172) kererbos issue when use jdbc to connect hive server2 on executors

2018-08-21 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25172.

Resolution: Invalid

Please use the mailing list for questions:
http://spark.apache.org/community.html

> kererbos issue when use jdbc to connect hive server2 on executors
> -
>
> Key: SPARK-25172
> URL: https://issues.apache.org/jira/browse/SPARK-25172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: sun zhiwei
>Priority: Major
>
> Hi All,
>    for some reasons,I must use jdbc to connect hive server2 on 
> executors,but when I test my program,it get the following logs:
>   javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>     at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>     at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>     at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>     at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>     at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>     at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>     at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>     at 
> org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
>     at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:168)
>     at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
>     at java.sql.DriverManager.getConnection(DriverManager.java:571)
>     at java.sql.DriverManager.getConnection(DriverManager.java:187)
>     at 
> com.hypers.bigdata.util.SparkHandler$.handleTable2(SparkHandler.scala:204)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:196)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:107)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>     at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:107)
>     at 
> com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:106)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)
>     at 
> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
>     at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
>     at 
> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
>     at 
> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
>     at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
>     at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
>     at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
>     ... 31 more
>  
>  I think it might because the executors can not pass the kerberos 
> authentification,so here's my question：
>  how could I make the executors pass kerberos authentification?
>  any suggestions would be prossiblely usefully,thx.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-21 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587702#comment-16587702
 ] 

Xiao Li commented on SPARK-23874:
-

Great! Then, users can keep using pyarrow 0.8 and get the fixes we need in 
Spark; otherwise, they have to upgrade numpy. 

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-21 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25161:
-

Assignee: Jiang Xingbo

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-21 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25161.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22158
[https://github.com/apache/spark/pull/22158]

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25177:


Assignee: Apache Spark

> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Assignee: Apache Spark
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a |           b |        c
>  ---++---
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25177:


Assignee: (was: Apache Spark)

> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a |           b |        c
>  ---++---
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587578#comment-16587578
 ] 

Apache Spark commented on SPARK-25177:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/22171

> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a |           b |        c
>  ---++---
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Vinod KC (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-25177:
-
Description: 
If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
 spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
 spark.sql("insert into test values(0, 0,0)")
 spark.sql("insert into test values(1, 1, 1)")
 spark.table("test").show()
|         a     |           b |               c |
|       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
scientific notation|
|1.000|1.00|1.|

 

 Eg: In Postgress
 --
 CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
 INSERT INTO Testdec VALUES (0,0,0);
 INSERT INTO Testdec VALUES (1,1,1);
 select * from Testdec;
 Result:
           a |           b |        c
 ---++---
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 

  was:
If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
 spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
 spark.sql("insert into test values(0, 0,0)")
 spark.sql("insert into test values(1, 1, 1)")
 spark.table("test").show()
|         a     |           b |               c |
|       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
scientific notation|
|1.000|1.00|1.|

 

 Eg: In Postgress
 --
 CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
 INSERT INTO Testdec VALUES (0,0,0);
 INSERT INTO Testdec VALUES (1,1,1);
 select * from Testdec;
 Result:
           a | .          b |        c
 ---+--+-
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 


> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a |           b |        c
>  ---++---
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Vinod KC (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-25177:
-
Description: 
If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
 spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
 spark.sql("insert into test values(0, 0,0)")
 spark.sql("insert into test values(1, 1, 1)")
 spark.table("test").show()


|         a     |           b |               c |

 
|       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
scientific notation|
|1.000|1.00|1.|

 

 Eg: In Postgress
 --
 CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
 INSERT INTO Testdec VALUES (0,0,0);
 INSERT INTO Testdec VALUES (1,1,1);
 select * from Testdec;
 Result:
           a | .          b |        c
 ---++-
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 

  was:
If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
spark.sql("insert into test values(0, 0,0)")
spark.sql("insert into test values(1, 1, 1)")
spark.table("test").show()
+-+--++
|         a      |           b  |               c  |
+-+---+---+
|       0E-7  |0.00 |         0E-8  | //If scale > 6, zero is displayed in 
scientific notation
|1.000|1.00|1.|
+-++--+

 Eg: In Postgress
--
CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
INSERT INTO Testdec VALUES (0,0,0);
INSERT INTO Testdec VALUES (1,1,1);
select * from Testdec;
Result:
          a | .          b |        c
---+---+--
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 


> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
>  
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a | .          b |        c
>  ---++-
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Vinod KC (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-25177:
-
Description: 
If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
 spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
 spark.sql("insert into test values(0, 0,0)")
 spark.sql("insert into test values(1, 1, 1)")
 spark.table("test").show()
|         a     |           b |               c |
|       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
scientific notation|
|1.000|1.00|1.|

 

 Eg: In Postgress
 --
 CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
 INSERT INTO Testdec VALUES (0,0,0);
 INSERT INTO Testdec VALUES (1,1,1);
 select * from Testdec;
 Result:
           a | .          b |        c
 ---+--+-
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 

  was:
If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
 spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
 spark.sql("insert into test values(0, 0,0)")
 spark.sql("insert into test values(1, 1, 1)")
 spark.table("test").show()


|         a     |           b |               c |

 
|       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
scientific notation|
|1.000|1.00|1.|

 

 Eg: In Postgress
 --
 CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
 INSERT INTO Testdec VALUES (0,0,0);
 INSERT INTO Testdec VALUES (1,1,1);
 select * from Testdec;
 Result:
           a | .          b |        c
 ---++-
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 


> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a | .          b |        c
>  ---+--+-
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Vinod KC (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-25177:
-
Summary: When dataframe decimal type column having scale higher than 6, 0 
values are shown in scientific notation  (was: When dataframe decimal type 
column having a scale higher than 6, 0 values are shown in scientific notation)

> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
> spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
> spark.sql("insert into test values(0, 0,0)")
> spark.sql("insert into test values(1, 1, 1)")
> spark.table("test").show()
> +-+--++
> |         a      |           b  |               c  |
> +-+---+---+
> |       0E-7  |0.00 |         0E-8  | //If scale > 6, zero is displayed 
> in scientific notation
> |1.000|1.00|1.|
> +-++--+
>  Eg: In Postgress
> --
> CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
> INSERT INTO Testdec VALUES (0,0,0);
> INSERT INTO Testdec VALUES (1,1,1);
> select * from Testdec;
> Result:
>           a | .          b |        c
> ---+---+--
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25177) When dataframe decimal type column having a scale higher than 6, 0 values are shown in scientific notation

2018-08-21 Thread Vinod KC (JIRA)

Vinod KC created SPARK-25177:


 Summary: When dataframe decimal type column having a scale higher 
than 6, 0 values are shown in scientific notation
 Key: SPARK-25177
 URL: https://issues.apache.org/jira/browse/SPARK-25177
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Vinod KC


If scale of decimal type is > 6 , 0 value will be shown in scientific notation 
and hence, when the dataframe output is saved to external database, it fails 
due to scientific notation on "0" values.

Eg: In Spark
 --
spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
spark.sql("insert into test values(0, 0,0)")
spark.sql("insert into test values(1, 1, 1)")
spark.table("test").show()
+-+--++
|         a      |           b  |               c  |
+-+---+---+
|       0E-7  |0.00 |         0E-8  | //If scale > 6, zero is displayed in 
scientific notation
|1.000|1.00|1.|
+-++--+

 Eg: In Postgress
--
CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
INSERT INTO Testdec VALUES (0,0,0);
INSERT INTO Testdec VALUES (1,1,1);
select * from Testdec;
Result:
          a | .          b |        c
---+---+--
 0.000 | 0.00 | 0.
 1.000 | 1.00 | 1.

We can make spark SQL result consistent with other Databases like Postgresql

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-08-21 Thread Mikhail Pryakhin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Pryakhin updated SPARK-25176:
-
Description: 
I'm using the latest spark version spark-core_2.11:2.3.1 which 
transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer 
contains an issue [1,2] which results in throwing ClassCastExceptions when 
serialising parameterised type hierarchy.
This issue has been fixed in kryo version 4.0.0 [3]. It would be great to have 
this update in Spark as well. Could you please upgrade the version of 
com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
You can find a simple test to reproduce the issue [4].

[1] https://github.com/EsotericSoftware/kryo/issues/384
[2] https://github.com/EsotericSoftware/kryo/issues/377
[3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
[4] https://github.com/mpryahin/kryo-parametrized-type-inheritance

  was:
I'm using the latest spark version spark-core_2.11:2.3.1 which 
transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer 
contains an issue [1,2] which results in throwing ClassCastExceptions when 
serialising parameterised type hierarchy.
This issue was fixed in kryo version 4.0.0 [3]. It would be great to have this 
update in Spark as well. Could you please upgrade the version of 
com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
You can find a simple test to reproduce the issue [4].

[1] https://github.com/EsotericSoftware/kryo/issues/384
[2] https://github.com/EsotericSoftware/kryo/issues/377
[3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
[4] https://github.com/mpryahin/kryo-parametrized-type-inheritance


> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Critical
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-08-21 Thread Mikhail Pryakhin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Pryakhin updated SPARK-25176:
-
Description: 
I'm using the latest spark version spark-core_2.11:2.3.1 which 
transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer 
contains an issue [1,2] which results in throwing ClassCastExceptions when 
serialising parameterised type hierarchy.
This issue was fixed in kryo version 4.0.0 [3]. It would be great to have this 
update in Spark as well. Could you please upgrade the version of 
com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
You can find a simple test to reproduce the issue [4].

[1] https://github.com/EsotericSoftware/kryo/issues/384
[2] https://github.com/EsotericSoftware/kryo/issues/377
[3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
[4] https://github.com/mpryahin/kryo-parametrized-type-inheritance

  was:
I'm using the latest spark version spark-core_2.11:2.3.1 which 
transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer 
contains an issue [1,2] which results in throwing ClassCastExceptions when 
serialising parameterised type hierarchy.
This issue was fixed in kryo version 4.0.0 [3]. It would be great to have this 
update in Spark as well. Could you please upgrade the version of 
com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
You can find a simple test to reproduce the issue [4].

the following command runs the test against kryo version 3.0.3
./gradlew run -PkryoVersion=3.0.3
the following command runs the test against kryo version 4.0.0
./gradlew run -PkryoVersion=4.0.0

[1] https://github.com/EsotericSoftware/kryo/issues/384
[2] https://github.com/EsotericSoftware/kryo/issues/377
[3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
[4] https://github.com/mpryahin/kryo-parametrized-type-inheritance


> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Critical
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue was fixed in kryo version 4.0.0 [3]. It would be great to have 
> this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 132 matches

Mail list logo