[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI

2019-01-31 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758037#comment-16758037
 ] 

Jungtaek Lim commented on SPARK-26792:
--

Hi [~Thatboix45], looks like you voted this issue: do you have actual use case 
of this? If your case is different then pointing YARN log url to the log 
directory, we may want to apply this instead of changing default value of YARN 
log executor url. 


> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24404) Increase currentEpoch when meet a EpochMarker in ContinuousQueuedDataReader.next() in CP mode based on PR #21353 #21332 #21293 and the latest master

2019-01-31 Thread Liangchang Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangchang Zhu closed SPARK-24404.
--

> Increase currentEpoch when meet a EpochMarker in 
> ContinuousQueuedDataReader.next()  in CP mode based  on PR #21353 #21332 
> #21293 and the latest master
> --
>
> Key: SPARK-24404
> URL: https://issues.apache.org/jira/browse/SPARK-24404
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Liangchang Zhu
>Priority: Major
>
> In CP mode,  based on PR #21353 #21332 #21293 and the latest master 
> ContinuousQueuedDataReader.next() will be invoked by 
> ContinuousDataSourceRDD.compute to return UnsafeRow. When currentEntry polled 
> from ArrayBlockingQueue is a EpochMarker, ContinuousQueuedDataReader will 
> send `ReportPartitionOffset` message to epochCoordinator with currentEpoch of 
> EpochTracker. The currentEpoch is a ThreadLocal variable, but now no place 
> invoke `incrementCurrentEpoch` to increase currentEpoch in its thread, so 
> `getCurrentEpoch` will return `None` all the time(because currentEpoch is 
> -1). This will cause exception when invoke `None.get`. At the same time, in 
> order to make the `ReportPartitionOffset` have correct semantics, we need 
> increase currentEpoch before send this message 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24404) Increase currentEpoch when meet a EpochMarker in ContinuousQueuedDataReader.next() in CP mode based on PR #21353 #21332 #21293 and the latest master

2019-01-31 Thread Liangchang Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liangchang Zhu resolved SPARK-24404.

Resolution: Won't Fix

> Increase currentEpoch when meet a EpochMarker in 
> ContinuousQueuedDataReader.next()  in CP mode based  on PR #21353 #21332 
> #21293 and the latest master
> --
>
> Key: SPARK-24404
> URL: https://issues.apache.org/jira/browse/SPARK-24404
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Liangchang Zhu
>Priority: Major
>
> In CP mode,  based on PR #21353 #21332 #21293 and the latest master 
> ContinuousQueuedDataReader.next() will be invoked by 
> ContinuousDataSourceRDD.compute to return UnsafeRow. When currentEntry polled 
> from ArrayBlockingQueue is a EpochMarker, ContinuousQueuedDataReader will 
> send `ReportPartitionOffset` message to epochCoordinator with currentEpoch of 
> EpochTracker. The currentEpoch is a ThreadLocal variable, but now no place 
> invoke `incrementCurrentEpoch` to increase currentEpoch in its thread, so 
> `getCurrentEpoch` will return `None` all the time(because currentEpoch is 
> -1). This will cause exception when invoke `None.get`. At the same time, in 
> order to make the `ReportPartitionOffset` have correct semantics, we need 
> increase currentEpoch before send this message 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator

2019-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26525:
---

Assignee: liupengcheng

> Fast release memory of ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-26525
> URL: https://issues.apache.org/jira/browse/SPARK-26525
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
>
> Currently, spark would not release ShuffleBlockFetcherIterator until the 
> whole task finished.
> In some conditions, it incurs memory leak.
> An example is Shuffle -> map -> Coalesce(shuffle = false). Each 
> ShuffleBlockFetcherIterator contains  some metas about 
> MapStatus(blocksByAddress) and each ShuffleMapTask will keep n(max to shuffle 
> partitions) shuffleBlockFetcherIterator for they are refered by 
> onCompleteCallbacks of TaskContext, in some case, it may take huge memory and 
> the memory will not released until the task finished.
> Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24541) TCP based shuffle

2019-01-31 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757985#comment-16757985
 ] 

Jungtaek Lim commented on SPARK-24541:
--

Continuous processing requires "single stage" to let all tasks launching all 
the time, as well as processing never stops. It works with query which only has 
map-like transformations, but stateful operations normally requires 
hash-partitioning so Spark should have some mechanism to route rows directly 
(as same as other streaming frameworks are doing).

Some of works were done around more than half a year ago so I cannot remember 
correctly, but the POC-like work is done with RPC, and this issue tracks the 
effort of changing them to TCP.

> TCP based shuffle
> -
>
> Key: SPARK-24541
> URL: https://issues.apache.org/jira/browse/SPARK-24541
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2

2019-01-31 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-26744:
---
Description: 
The internal API supportDataType in FileFormat validates the output/input 
schema before task execution starts. So that we can avoid launching read/write 
tasks which would fail. Also, users can see clean error messages.

This PR is to implement the same internal API in the FileDataSourceV2 
framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The 
API is added in two places:

1. Read path: the table schema is determined in TableProvider.getTable. The 
actual read schema can be a subset of the table schema. This PR proposes to 
validate the actual read schema in FileScan.
2. Write path: validate the actual output schema in FileWriteBuilder.

  was:
The method supportDataType in FileFormat helps to validate the output/input 
schema before execution starts. So that we can avoid some invalid data source 
IO, and users can see clean error messages.

This PR is to implement the same method in the FileDataSourceV2 framework. 
Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added 
in two places:

1. FileWriteBuilder: this is where we can get the actual write schema
2. FileScan: this is where we can get the actual read schema.


> Support schema validation in File Source V2
> ---
>
> Key: SPARK-26744
> URL: https://issues.apache.org/jira/browse/SPARK-26744
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The internal API supportDataType in FileFormat validates the output/input 
> schema before task execution starts. So that we can avoid launching 
> read/write tasks which would fail. Also, users can see clean error messages.
> This PR is to implement the same internal API in the FileDataSourceV2 
> framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The 
> API is added in two places:
> 1. Read path: the table schema is determined in TableProvider.getTable. The 
> actual read schema can be a subset of the table schema. This PR proposes to 
> validate the actual read schema in FileScan.
> 2. Write path: validate the actual output schema in FileWriteBuilder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24541) TCP based shuffle

2019-01-31 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757936#comment-16757936
 ] 

Imran Rashid commented on SPARK-24541:
--

can you explain what this means at all?

regular spark shuffles are already done with TCP, but I'm sure you are 
referring to something entirely different here

> TCP based shuffle
> -
>
> Key: SPARK-24541
> URL: https://issues.apache.org/jira/browse/SPARK-24541
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2

2019-01-31 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-26744:
---
Description: 
The method supportDataType in FileFormat helps to validate the output/input 
schema before execution starts. So that we can avoid some invalid data source 
IO, and users can see clean error messages.

This PR is to implement the same method in the FileDataSourceV2 framework. 
Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added 
in two places:

1. FileWriteBuilder: this is where we can get the actual write schema
2. FileScan: this is where we can get the actual read schema.

> Support schema validation in File Source V2
> ---
>
> Key: SPARK-26744
> URL: https://issues.apache.org/jira/browse/SPARK-26744
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The method supportDataType in FileFormat helps to validate the output/input 
> schema before execution starts. So that we can avoid some invalid data source 
> IO, and users can see clean error messages.
> This PR is to implement the same method in the FileDataSourceV2 framework. 
> Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is 
> added in two places:
> 1. FileWriteBuilder: this is where we can get the actual write schema
> 2. FileScan: this is where we can get the actual read schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26730) Strip redundant AssertNotNull expression for ExpressionEncoder's serializer

2019-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26730.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23651
[https://github.com/apache/spark/pull/23651]

> Strip redundant AssertNotNull expression for ExpressionEncoder's serializer
> ---
>
> Key: SPARK-26730
> URL: https://issues.apache.org/jira/browse/SPARK-26730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> For types like Product, we've already add AssertNotNull when we construct 
> serializer(pls see the code below), so we could strip redundant AssertNotNull 
> for those types. Please see the code with the related PR for details.
>  
> {code:java}
> val fieldValue = Invoke(
> AssertNotNull(inputObject, walkedTypePath), fieldName, dataTypeFor(fieldType),
> returnNullable = !fieldType.typeSymbol.asClass.isPrimitive)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26730) Strip redundant AssertNotNull expression for ExpressionEncoder's serializer

2019-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26730:
---

Assignee: wuyi

> Strip redundant AssertNotNull expression for ExpressionEncoder's serializer
> ---
>
> Key: SPARK-26730
> URL: https://issues.apache.org/jira/browse/SPARK-26730
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> For types like Product, we've already add AssertNotNull when we construct 
> serializer(pls see the code below), so we could strip redundant AssertNotNull 
> for those types. Please see the code with the related PR for details.
>  
> {code:java}
> val fieldValue = Invoke(
> AssertNotNull(inputObject, walkedTypePath), fieldName, dataTypeFor(fieldType),
> returnNullable = !fieldType.typeSymbol.asClass.isPrimitive)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757898#comment-16757898
 ] 

Hyukjin Kwon commented on SPARK-26786:
--

This behaviour is inherited from Univocity parser if I am not mistaken. Can you 
iterate with this issue there?

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  *Expected result:*
> {code:java}
> +---+--+
> |_c0| _c1|
> +---+--+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26787.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23705
[https://github.com/apache/spark/pull/23705]

> Fix standardization error message in WeightedLeastSquares
> -
>
> Key: SPARK-26787
> URL: https://issues.apache.org/jira/browse/SPARK-26787
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
> Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
> Beta.
>  
>Reporter: Brian Scannell
>Assignee: Brian Scannell
>Priority: Trivial
> Fix For: 3.0.0
>
>
> There is an error message in WeightedLeastSquares.scala that is incorrect and 
> thus not very helpful for diagnosing an issue. The problem arises when doing 
> regularized LinearRegression on a constant label. Even when the parameter 
> standardization=False, the error will falsely state that standardization was 
> set to True:
> {{The standard deviation of the label is zero. Model cannot be regularized 
> with standardization=true}}
> This is because under the hood, LinearRegression automatically sets a 
> parameter standardizeLabel=True. This was chosen for consistency with GLMNet, 
> although WeightedLeastSquares is written to allow standardizeLabel to be set 
> either way and work (although the public LinearRegression API does not allow 
> it).
>  
> I will submit a pull request with my suggested wording.
>  
> Relevant:
> [https://github.com/apache/spark/pull/10702]
> [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
>  
>  
> The following Python code will replicate the error. 
> {code:java}
> import pandas as pd
> from pyspark.ml.feature import VectorAssembler
> from pyspark.ml.regression import LinearRegression
> df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
> spark_df = spark.createDataFrame(df)
> vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
> 'features')
> train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])
> lr = LinearRegression(featuresCol='features', labelCol='label', 
> fitIntercept=False, standardization=False, regParam=1e-4)
> lr_model = lr.fit(train_sdf)
> {code}
>  
> For context, the reason someone might want to do this is if they are trying 
> to fit a model to estimate components of a fixed total. The label indicates 
> the total is always 100%, but the components vary. For example, trying to 
> estimate the unknown weights of different quantities of substances in a 
> series of full bins. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24959:
-
Fix Version/s: (was: 2.4.0)

> Do not invoke the CSV/JSON parser for empty schema
> --
>
> Key: SPARK-24959
> URL: https://issues.apache.org/jira/browse/SPARK-24959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently JSON and CSV parsers are called even if required schema is empty. 
> Invoking the parser per each line has some non-zero overhead. The action can 
> be skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26745:
-
Fix Version/s: 2.4.1

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7721) Generate test coverage report from Python

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-7721:
---

Assignee: Hyukjin Kwon

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7721) Generate test coverage report from Python

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-7721.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23117
[https://github.com/apache/spark/pull/23117]

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-01-31 Thread Emmanuel Arias (JIRA)
Emmanuel Arias created SPARK-26807:
--

 Summary: Confusing documentation regarding installation from PyPi
 Key: SPARK-26807
 URL: https://issues.apache.org/jira/browse/SPARK-26807
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Emmanuel Arias


Hello!

I am new using Spark. Reading the documentation I think that is a little 
confusing on Downloading section.

[ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
 write: "Scala and Java users can include Spark in their projects using its 
Maven coordinates and in the future Python users can also install Spark from 
PyPI.", I interpret that currently Spark is not on PyPi yet. But  
[https://spark.apache.org/downloads.html] write: 
"[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26808) Pruned schema should not change nullability

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26808:


Assignee: (was: Apache Spark)

> Pruned schema should not change nullability
> ---
>
> Key: SPARK-26808
> URL: https://issues.apache.org/jira/browse/SPARK-26808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We prune unnecessary nested fields from requested schema when reading 
> Parquet. Now seems we don't keep original nullability in pruned schema. We 
> should keep original nullability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26808) Pruned schema should not change nullability

2019-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757880#comment-16757880
 ] 

Apache Spark commented on SPARK-26808:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/23719

> Pruned schema should not change nullability
> ---
>
> Key: SPARK-26808
> URL: https://issues.apache.org/jira/browse/SPARK-26808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We prune unnecessary nested fields from requested schema when reading 
> Parquet. Now seems we don't keep original nullability in pruned schema. We 
> should keep original nullability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26808) Pruned schema should not change nullability

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26808:


Assignee: Apache Spark

> Pruned schema should not change nullability
> ---
>
> Key: SPARK-26808
> URL: https://issues.apache.org/jira/browse/SPARK-26808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> We prune unnecessary nested fields from requested schema when reading 
> Parquet. Now seems we don't keep original nullability in pruned schema. We 
> should keep original nullability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26808) Pruned schema should not change nullability

2019-01-31 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-26808:
---

 Summary: Pruned schema should not change nullability
 Key: SPARK-26808
 URL: https://issues.apache.org/jira/browse/SPARK-26808
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We prune unnecessary nested fields from requested schema when reading Parquet. 
Now seems we don't keep original nullability in pruned schema. We should keep 
original nullability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26795:
--
   Flags:   (was: Patch)
Target Version/s:   (was: 2.3.2, 2.4.0)
   Fix Version/s: (was: 2.3.2)
  (was: 2.3.1)
  (was: 2.4.0)
  (was: 2.3.0)

(Don't set Fix, Target versions)

> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26787:
-

Assignee: Brian Scannell

> Fix standardization error message in WeightedLeastSquares
> -
>
> Key: SPARK-26787
> URL: https://issues.apache.org/jira/browse/SPARK-26787
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
> Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
> Beta.
>  
>Reporter: Brian Scannell
>Assignee: Brian Scannell
>Priority: Trivial
>
> There is an error message in WeightedLeastSquares.scala that is incorrect and 
> thus not very helpful for diagnosing an issue. The problem arises when doing 
> regularized LinearRegression on a constant label. Even when the parameter 
> standardization=False, the error will falsely state that standardization was 
> set to True:
> {{The standard deviation of the label is zero. Model cannot be regularized 
> with standardization=true}}
> This is because under the hood, LinearRegression automatically sets a 
> parameter standardizeLabel=True. This was chosen for consistency with GLMNet, 
> although WeightedLeastSquares is written to allow standardizeLabel to be set 
> either way and work (although the public LinearRegression API does not allow 
> it).
>  
> I will submit a pull request with my suggested wording.
>  
> Relevant:
> [https://github.com/apache/spark/pull/10702]
> [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
>  
>  
> The following Python code will replicate the error. 
> {code:java}
> import pandas as pd
> from pyspark.ml.feature import VectorAssembler
> from pyspark.ml.regression import LinearRegression
> df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
> spark_df = spark.createDataFrame(df)
> vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
> 'features')
> train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])
> lr = LinearRegression(featuresCol='features', labelCol='label', 
> fitIntercept=False, standardization=False, regParam=1e-4)
> lr_model = lr.fit(train_sdf)
> {code}
>  
> For context, the reason someone might want to do this is if they are trying 
> to fit a model to estimate components of a fixed total. The label indicates 
> the total is always 100%, but the components vary. For example, trying to 
> estimate the unknown weights of different quantities of substances in a 
> series of full bins. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25997.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22996
[https://github.com/apache/spark/pull/22996]

> Python example code for Power Iteration Clustering in spark.ml
> --
>
> Key: SPARK-25997
> URL: https://issues.apache.org/jira/browse/SPARK-25997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add a python example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25997:
-

Assignee: Huaxin Gao

> Python example code for Power Iteration Clustering in spark.ml
> --
>
> Key: SPARK-25997
> URL: https://issues.apache.org/jira/browse/SPARK-25997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Add a python example code for Power iteration clustering in spark.ml examples



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display

2019-01-31 Thread hantiantian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hantiantian updated SPARK-26726:

Description: 
The amount of memory used by the broadcast variable is not synchronized to the 
UI display,

spark-sql>  select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id;

View the app's driver log:

2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB)
 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 2.5 GB)
 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 2.5 GB)
 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 2.5 GB)
 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 2.5 GB)
 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 2.5 GB)

 

Web UI does not have the use of memory,
||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk 
Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task Time 
(GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump||
|0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 B|0.0 
B| | |
|driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 ms)|0.0 
B|0.0 B|0.0 B| | |
|1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 B|0.0 
B|0.0 B| | |
|2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 B|0.0 
B| | |

 

 

  was:
The amount of memory used by the broadcast variable is not synchronized to the 
UI display,

spark-sql>  select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id;

View the app's driver log:

2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB)
 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 2.5 GB)
 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 2.5 GB)
 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 2.5 GB)
 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 2.5 GB)
 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: Added 
broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 2.5 GB)

 

Web UI does not have the use of memory,
||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk 
Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task Time 
(GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump||
|0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 B|0.0 
B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout]
 
[stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread
 Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]|
|driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 ms)|0.0 
B|0.0 B|0.0 B| |[Thread 
Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]|
|1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 B|0.0 
B|0.0 
B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout]
 
[stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread
 Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]|
|2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 B|0.0 
B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout]
 
[stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread
 Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]|

 

 


>   Synchronize the amount of memory used by the broadcast variable to the UI 
> display
> ---
>
> Key: SPARK-26726
> URL: https://issues.apache.org/jira/browse/SPARK-26726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects 

[jira] [Comment Edited] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-01-31 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757842#comment-16757842
 ] 

Jungtaek Lim edited comment on SPARK-26783 at 2/1/19 1:01 AM:
--

I'm not sure about what you're referring to here from SPARK-23685:

{quote}
Originally this pr was created as "failOnDataLoss" doesn't have any impact when 
set in structured streaming. But found out that ,the variable that needs to be 
used is "failondataloss" (all in lower case).
{quote}

I just played with test "failOnDataLoss=false should not return duplicated 
records: v1" in KafkaDontFailOnDataLossSuite, and looks like it works as 
intended (as case-insensitive manner).

"failOnDataLoss" -> "false" // passed
"failondataloss" -> "false" // passed
"FAILONDATALOSS" -> "false" // passed
 // failed
"failOnDataLoss" -> "true" // failed


was (Author: kabhwan):
I'm not sure about what [~sindiri] left a comment on the PR:

{quote}
Originally this pr was created as "failOnDataLoss" doesn't have any impact when 
set in structured streaming. But found out that ,the variable that needs to be 
used is "failondataloss" (all in lower case).
{quote}

I just played with test "failOnDataLoss=false should not return duplicated 
records: v1" in KafkaDontFailOnDataLossSuite, and looks like it works as 
intended (as case-insensitive manner).

"failOnDataLoss" -> "false" // passed
"failondataloss" -> "false" // passed
"FAILONDATALOSS" -> "false" // passed
 // failed
"failOnDataLoss" -> "true" // failed

> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-01-31 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757842#comment-16757842
 ] 

Jungtaek Lim commented on SPARK-26783:
--

I'm not sure about what [~sindiri] left a comment on the PR:

{quote}
Originally this pr was created as "failOnDataLoss" doesn't have any impact when 
set in structured streaming. But found out that ,the variable that needs to be 
used is "failondataloss" (all in lower case).
{quote}

I just played with test "failOnDataLoss=false should not return duplicated 
records: v1" in KafkaDontFailOnDataLossSuite, and looks like it works as 
intended (as case-insensitive manner).

"failOnDataLoss" -> "false" // passed
"failondataloss" -> "false" // passed
"FAILONDATALOSS" -> "false" // passed
 // failed
"failOnDataLoss" -> "true" // failed

> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26793) Remove spark.shuffle.manager

2019-01-31 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-26793.
-
Resolution: Invalid

> Remove spark.shuffle.manager
> 
>
> Key: SPARK-26793
> URL: https://issues.apache.org/jira/browse/SPARK-26793
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, `ShuffleManager` always uses `SortShuffleManager`,  I think this 
> configuration can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2019-01-31 Thread Robert Reid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757828#comment-16757828
 ] 

Robert Reid commented on SPARK-25136:
-

[~gsomogyi] I haven't had a chance to retry it. Our build system was changed 
since I last was working on this. So I'll need to rework some things to get 
that environment runnable.

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading 
> delta file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  of HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]:
>  
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  does not exist}}
> {{Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}
>  
> Worker log entries
> {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance 
> 

[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-01-31 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757824#comment-16757824
 ] 

Shixiong Zhu commented on SPARK-26783:
--

[~gsomogyi] This seems just an API document issue. Right? If the user passes 
"failOnDataLoss", the Kafka source will pick it up correctly.

> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Description: 
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

This issue was reported by [~liancheng]

  was:
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}


> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26806:


Assignee: Apache Spark  (was: Shixiong Zhu)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Reporter: liancheng  (was: Shixiong Zhu)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26806:


Assignee: Shixiong Zhu  (was: Apache Spark)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Affects Version/s: 2.2.1
   2.3.0
   2.3.1
   2.3.2

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-01-31 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26806:


 Summary: EventTimeStats.merge doesn't handle "zero.merge(zero)" 
correctly
 Key: SPARK-26806
 URL: https://issues.apache.org/jira/browse/SPARK-26806
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757814#comment-16757814
 ] 

Apache Spark commented on SPARK-26654:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23662

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26654:


Assignee: (was: Apache Spark)

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26654:


Assignee: Apache Spark

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26805) Eliminate double checking of stringToDate and stringToTimestamp inputs

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26805:


Assignee: Apache Spark

> Eliminate double checking of stringToDate and stringToTimestamp inputs
> --
>
> Key: SPARK-26805
> URL: https://issues.apache.org/jira/browse/SPARK-26805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The LocalTime and LocalTime classes as well as other java.time classes used 
> inside of stringToTimestamp and stringToDate do checking of input parameters 
> already. There is no need to do that twice. The ticket aims to remove 
> isInvalidDate() from DateTimeUtils and eliminate such double checking. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26757) GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26757.
---
   Resolution: Fixed
Fix Version/s: 2.3.4
   2.4.1
   3.0.0

Issue resolved by pull request 23681
[https://github.com/apache/spark/pull/23681]

> GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs
> 
>
> Key: SPARK-26757
> URL: https://issues.apache.org/jira/browse/SPARK-26757
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Assignee: Huon Wilson
>Priority: Minor
> Fix For: 3.0.0, 2.4.1, 2.3.4
>
>
> The {{EdgeRDDImpl}} and {{VertexRDDImpl}} types provided by {{GraphX}} throw 
> an {{java.lang.UnsupportedOperationException: empty collection}} exception if 
> {{count}} is called on an empty instance, when they should return 0.
> {code:scala}
> import org.apache.spark.graphx.{Graph, Edge}
> val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0)
> graph.vertices.count
> graph.edges.count
> {code}
> Running that code in a spark-shell:
> {code:none}
> scala> import org.apache.spark.graphx.{Graph, Edge}
> import org.apache.spark.graphx.{Graph, Edge}
> scala> val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0)
> graph: org.apache.spark.graphx.Graph[Int,Unit] = 
> org.apache.spark.graphx.impl.GraphImpl@6879e983
> scala> graph.vertices.count
> java.lang.UnsupportedOperationException: empty collection
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>   at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011)
>   at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90)
>   ... 49 elided
> scala> graph.edges.count
> java.lang.UnsupportedOperationException: empty collection
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>   at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011)
>   at org.apache.spark.graphx.impl.EdgeRDDImpl.count(EdgeRDDImpl.scala:90)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26805) Eliminate double checking of stringToDate and stringToTimestamp inputs

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26805:


Assignee: (was: Apache Spark)

> Eliminate double checking of stringToDate and stringToTimestamp inputs
> --
>
> Key: SPARK-26805
> URL: https://issues.apache.org/jira/browse/SPARK-26805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The LocalTime and LocalTime classes as well as other java.time classes used 
> inside of stringToTimestamp and stringToDate do checking of input parameters 
> already. There is no need to do that twice. The ticket aims to remove 
> isInvalidDate() from DateTimeUtils and eliminate such double checking. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26757) GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26757:
-

Assignee: Huon Wilson

> GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs
> 
>
> Key: SPARK-26757
> URL: https://issues.apache.org/jira/browse/SPARK-26757
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Assignee: Huon Wilson
>Priority: Minor
>
> The {{EdgeRDDImpl}} and {{VertexRDDImpl}} types provided by {{GraphX}} throw 
> an {{java.lang.UnsupportedOperationException: empty collection}} exception if 
> {{count}} is called on an empty instance, when they should return 0.
> {code:scala}
> import org.apache.spark.graphx.{Graph, Edge}
> val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0)
> graph.vertices.count
> graph.edges.count
> {code}
> Running that code in a spark-shell:
> {code:none}
> scala> import org.apache.spark.graphx.{Graph, Edge}
> import org.apache.spark.graphx.{Graph, Edge}
> scala> val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0)
> graph: org.apache.spark.graphx.Graph[Int,Unit] = 
> org.apache.spark.graphx.impl.GraphImpl@6879e983
> scala> graph.vertices.count
> java.lang.UnsupportedOperationException: empty collection
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>   at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011)
>   at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90)
>   ... 49 elided
> scala> graph.edges.count
> java.lang.UnsupportedOperationException: empty collection
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>   at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011)
>   at org.apache.spark.graphx.impl.EdgeRDDImpl.count(EdgeRDDImpl.scala:90)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26805) Eliminate double checking of stringToDate and stringToTimestamp inputs

2019-01-31 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26805:
--

 Summary: Eliminate double checking of stringToDate and 
stringToTimestamp inputs
 Key: SPARK-26805
 URL: https://issues.apache.org/jira/browse/SPARK-26805
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The LocalTime and LocalTime classes as well as other java.time classes used 
inside of stringToTimestamp and stringToDate do checking of input parameters 
already. There is no need to do that twice. The ticket aims to remove 
isInvalidDate() from DateTimeUtils and eliminate such double checking. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT

2019-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26799.
-
   Resolution: Fixed
 Assignee: Chenxiao Mao
Fix Version/s: 3.0.0

> Make ANTLR v4 version consistent between Maven and SBT
> --
>
> Key: SPARK-26799
> URL: https://issues.apache.org/jira/browse/SPARK-26799
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently ANTLR v4 versions used by Maven and SBT are slightly different. 
> Maven uses 4.7.1 while SBT uses 4.7.
>  * Maven(pom.xml): 4.7.1
>  * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7"
> We should make Maven and SBT use a single version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26734:


Assignee: (was: Apache Spark)

> StackOverflowError on WAL serialization caused by large receivedBlockQueue
> --
>
> Key: SPARK-26734
> URL: https://issues.apache.org/jira/browse/SPARK-26734
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
> Environment: spark 2.4.0 streaming job
> java 1.8
> scala 2.11.12
>Reporter: Ross M. Lodge
>Priority: Major
>
> We encountered an intermittent StackOverflowError with a stack trace similar 
> to:
>  
> {noformat}
> Exception in thread "JobGenerator" java.lang.StackOverflowError
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat}
> The name of the thread has been seen to be either "JobGenerator" or 
> "streaming-start", depending on when in the lifecycle of the job the problem 
> occurs.  It appears to only occur in streaming jobs with checkpointing and 
> WAL enabled; this has prevented us from upgrading to v2.4.0.
>  
> Via debugging, we tracked this down to allocateBlocksToBatch in 
> ReceivedBlockTracker:
> {code:java}
> /**
>  * Allocate all unallocated blocks to the given batch.
>  * This event will get written to the write ahead log (if enabled).
>  */
> def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
>   if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
> val streamIdToBlocks = streamIds.map { streamId =>
>   (streamId, getReceivedBlockQueue(streamId).clone())
> }.toMap
> val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
> if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
>   streamIds.foreach(getReceivedBlockQueue(_).clear())
>   timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
>   lastAllocatedBatchTime = batchTime
> } else {
>   logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
> }
>   } else {
> // This situation occurs when:
> // 1. WAL is ended with BatchAllocationEvent, but without 
> BatchCleanupEvent,
> // possibly processed batch job or half-processed batch job need to be 
> processed again,
> // so the batchTime will be equal to lastAllocatedBatchTime.
> // 2. Slow checkpointing makes recovered batch time older than WAL 
> recovered
> // lastAllocatedBatchTime.
> // This situation will only occurs in recovery time.
> logInfo(s"Possibly processed batch $batchTime needs to be processed again 
> in WAL recovery")
>   }
> }
> {code}
> Prior to 2.3.1, this code did
> {code:java}
> getReceivedBlockQueue(streamId).dequeueAll(x => true){code}
> but it was changed as part of SPARK-23991 to
> {code:java}
> getReceivedBlockQueue(streamId).clone(){code}
> We've not been able to reproduce this in a test of the actual above method, 
> but we've been able to produce a test that reproduces it by putting a lot of 
> values into the queue:
>  
> {code:java}
> class SerializationFailureTest extends FunSpec {
>   private val logger = LoggerFactory.getLogger(getClass)
>   private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
>   describe("Queue") {
> it("should be serializable") {
>   runTest(1062)
> }
> it("should not be serializable") {
>   runTest(1063)
> }
> it("should DEFINITELY not be serializable") {
>   runTest(199952)
> }
>   }
>   private def runTest(mx: Int): Array[Byte] = {
> try {
>   val random = new scala.util.Random()
>   val queue = new ReceivedBlockQueue()
>   for (_ <- 0 until mx) {
> queue += ReceivedBlockInfo(
>   streamId = 0,
>   numRecords = Some(random.nextInt(5)),
>   metadataOption = None,
>   blockStoreResult = WriteAheadLogBasedStoreResult(
> blockId = StreamBlockId(0, random.nextInt()),
> numRecords = Some(random.nextInt(5)),
> walRecordHandle = FileBasedWriteAheadLogSegment(
>   path = 
> s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""",
>   offset = random.nextLong(),
>   length = random.nextInt()
> )
>   )
> )
>   }
>   val record = BatchAllocationEvent(
> Time(154832040L), 

[jira] [Assigned] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26734:


Assignee: Apache Spark

> StackOverflowError on WAL serialization caused by large receivedBlockQueue
> --
>
> Key: SPARK-26734
> URL: https://issues.apache.org/jira/browse/SPARK-26734
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
> Environment: spark 2.4.0 streaming job
> java 1.8
> scala 2.11.12
>Reporter: Ross M. Lodge
>Assignee: Apache Spark
>Priority: Major
>
> We encountered an intermittent StackOverflowError with a stack trace similar 
> to:
>  
> {noformat}
> Exception in thread "JobGenerator" java.lang.StackOverflowError
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat}
> The name of the thread has been seen to be either "JobGenerator" or 
> "streaming-start", depending on when in the lifecycle of the job the problem 
> occurs.  It appears to only occur in streaming jobs with checkpointing and 
> WAL enabled; this has prevented us from upgrading to v2.4.0.
>  
> Via debugging, we tracked this down to allocateBlocksToBatch in 
> ReceivedBlockTracker:
> {code:java}
> /**
>  * Allocate all unallocated blocks to the given batch.
>  * This event will get written to the write ahead log (if enabled).
>  */
> def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
>   if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
> val streamIdToBlocks = streamIds.map { streamId =>
>   (streamId, getReceivedBlockQueue(streamId).clone())
> }.toMap
> val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
> if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
>   streamIds.foreach(getReceivedBlockQueue(_).clear())
>   timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
>   lastAllocatedBatchTime = batchTime
> } else {
>   logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
> }
>   } else {
> // This situation occurs when:
> // 1. WAL is ended with BatchAllocationEvent, but without 
> BatchCleanupEvent,
> // possibly processed batch job or half-processed batch job need to be 
> processed again,
> // so the batchTime will be equal to lastAllocatedBatchTime.
> // 2. Slow checkpointing makes recovered batch time older than WAL 
> recovered
> // lastAllocatedBatchTime.
> // This situation will only occurs in recovery time.
> logInfo(s"Possibly processed batch $batchTime needs to be processed again 
> in WAL recovery")
>   }
> }
> {code}
> Prior to 2.3.1, this code did
> {code:java}
> getReceivedBlockQueue(streamId).dequeueAll(x => true){code}
> but it was changed as part of SPARK-23991 to
> {code:java}
> getReceivedBlockQueue(streamId).clone(){code}
> We've not been able to reproduce this in a test of the actual above method, 
> but we've been able to produce a test that reproduces it by putting a lot of 
> values into the queue:
>  
> {code:java}
> class SerializationFailureTest extends FunSpec {
>   private val logger = LoggerFactory.getLogger(getClass)
>   private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
>   describe("Queue") {
> it("should be serializable") {
>   runTest(1062)
> }
> it("should not be serializable") {
>   runTest(1063)
> }
> it("should DEFINITELY not be serializable") {
>   runTest(199952)
> }
>   }
>   private def runTest(mx: Int): Array[Byte] = {
> try {
>   val random = new scala.util.Random()
>   val queue = new ReceivedBlockQueue()
>   for (_ <- 0 until mx) {
> queue += ReceivedBlockInfo(
>   streamId = 0,
>   numRecords = Some(random.nextInt(5)),
>   metadataOption = None,
>   blockStoreResult = WriteAheadLogBasedStoreResult(
> blockId = StreamBlockId(0, random.nextInt()),
> numRecords = Some(random.nextInt(5)),
> walRecordHandle = FileBasedWriteAheadLogSegment(
>   path = 
> s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""",
>   offset = random.nextLong(),
>   length = random.nextInt()
> )
>   )
> )
>   }
>   val record = BatchAllocationEvent(
> 

[jira] [Issue Comment Deleted] (SPARK-24432) Add support for dynamic resource allocation

2019-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24432:
---
Comment: was deleted

(was: vanzin closed pull request #22722: [SPARK-24432][k8s] Add support for 
dynamic resource allocation on Kubernetes 
URL: https://github.com/apache/spark/pull/22722
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/kubernetes/KubernetesExternalShuffleClient.java
 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/kubernetes/KubernetesExternalShuffleClient.java
new file mode 100644
index 0..7135d1af5facd
--- /dev/null
+++ 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/kubernetes/KubernetesExternalShuffleClient.java
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.shuffle.kubernetes;
+
+import com.google.common.util.concurrent.ThreadFactoryBuilder;
+import org.apache.spark.network.client.RpcResponseCallback;
+import org.apache.spark.network.client.TransportClient;
+import org.apache.spark.network.sasl.SecretKeyHolder;
+import org.apache.spark.network.shuffle.ExternalShuffleClient;
+import org.apache.spark.network.shuffle.protocol.RegisterDriver;
+import org.apache.spark.network.shuffle.protocol.ShuffleServiceHeartbeat;
+import org.apache.spark.network.util.TransportConf;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * A client for talking to the external shuffle service in Kubernetes cluster 
mode.
+ *
+ * This is used by the each Spark executor to register with a corresponding 
external
+ * shuffle service on the cluster. The purpose is for cleaning up shuffle files
+ * reliably if the application exits unexpectedly.
+ */
+public class KubernetesExternalShuffleClient extends ExternalShuffleClient {
+  private static final Logger logger = LoggerFactory
+.getLogger(KubernetesExternalShuffleClient.class);
+
+  private final ScheduledExecutorService heartbeaterThread =
+Executors.newSingleThreadScheduledExecutor(
+  new ThreadFactoryBuilder()
+.setDaemon(true)
+.setNameFormat("kubernetes-external-shuffle-client-heartbeater")
+.build());
+
+  /**
+   * Creates a Kubernetes external shuffle client that wraps the {@link 
ExternalShuffleClient}.
+   * Please refer to docs on {@link ExternalShuffleClient} for more 
information.
+   */
+  public KubernetesExternalShuffleClient(
+  TransportConf conf,
+  SecretKeyHolder secretKeyHolder,
+  boolean saslEnabled,
+  long registrationTimeoutMs) {
+super(conf, secretKeyHolder, saslEnabled, registrationTimeoutMs);
+  }
+
+  public void registerDriverWithShuffleService(
+  String host,
+  int port,
+  long heartbeatTimeoutMs,
+  long heartbeatIntervalMs) throws IOException, InterruptedException {
+checkInit();
+ByteBuffer registerDriver = new RegisterDriver(appId, 
heartbeatTimeoutMs).toByteBuffer();
+TransportClient client = clientFactory.createClient(host, port);
+client.sendRpc(registerDriver, new RegisterDriverCallback(client, 
heartbeatIntervalMs));
+  }
+
+  @Override
+  public void close() {
+heartbeaterThread.shutdownNow();
+super.close();
+  }
+
+  private class RegisterDriverCallback implements RpcResponseCallback {
+private final TransportClient client;
+private final long heartbeatIntervalMs;
+
+private RegisterDriverCallback(TransportClient client, long 
heartbeatIntervalMs) {
+  this.client = client;
+  this.heartbeatIntervalMs = heartbeatIntervalMs;
+}
+
+@Override
+public void onSuccess(ByteBuffer response) {
+  

[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation

2019-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24432:
--

Assignee: Marcelo Vanzin

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Assignee: Marcelo Vanzin
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation

2019-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24432:
--

Assignee: (was: Marcelo Vanzin)

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26804) Spark sql carries newline char from last csv column when imported

2019-01-31 Thread Raj (JIRA)
Raj created SPARK-26804:
---

 Summary: Spark sql carries newline char from last csv column when 
imported
 Key: SPARK-26804
 URL: https://issues.apache.org/jira/browse/SPARK-26804
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Raj


I am trying to generate external sql tables in DataBricks using Spark sql 
query. Below is my query. The query reads csv file and creates external table 
but it carries the newline char while creating the last column. Is there a way 
to resolve this issue? 

 

%sql
create table if not exists <>
using CSV
options ("header"="true", "inferschema"="true","multiLine"="true", "escape"='"')
location 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26803) include sbin subdirectory in pyspark

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26803:


Assignee: (was: Apache Spark)

> include sbin subdirectory in pyspark
> 
>
> Key: SPARK-26803
> URL: https://issues.apache.org/jira/browse/SPARK-26803
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Oliver Urs Lenz
>Priority: Minor
>  Labels: build, easyfix, newbie, ready-to-commit, usability
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> At present, the `sbin` subdirectory is not included in Pyspark. This seems to 
> be unintentional, and it makes e.g. the discussion at 
> [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html]
>  confusing. In fact, the sbin directory can be downloaded manually into the 
> pyspark directory, but it would be nice if it were included automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26803) include sbin subdirectory in pyspark

2019-01-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26803:
--
Shepherd:   (was: Sean Owen)

> include sbin subdirectory in pyspark
> 
>
> Key: SPARK-26803
> URL: https://issues.apache.org/jira/browse/SPARK-26803
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Oliver Urs Lenz
>Priority: Minor
>  Labels: build, easyfix, newbie, ready-to-commit, usability
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> At present, the `sbin` subdirectory is not included in Pyspark. This seems to 
> be unintentional, and it makes e.g. the discussion at 
> [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html]
>  confusing. In fact, the sbin directory can be downloaded manually into the 
> pyspark directory, but it would be nice if it were included automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26803) include sbin subdirectory in pyspark

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26803:


Assignee: Apache Spark

> include sbin subdirectory in pyspark
> 
>
> Key: SPARK-26803
> URL: https://issues.apache.org/jira/browse/SPARK-26803
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Oliver Urs Lenz
>Assignee: Apache Spark
>Priority: Minor
>  Labels: build, easyfix, newbie, ready-to-commit, usability
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> At present, the `sbin` subdirectory is not included in Pyspark. This seems to 
> be unintentional, and it makes e.g. the discussion at 
> [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html]
>  confusing. In fact, the sbin directory can be downloaded manually into the 
> pyspark directory, but it would be nice if it were included automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26803) include sbin subdirectory in pyspark

2019-01-31 Thread Oliver Urs Lenz (JIRA)
Oliver Urs Lenz created SPARK-26803:
---

 Summary: include sbin subdirectory in pyspark
 Key: SPARK-26803
 URL: https://issues.apache.org/jira/browse/SPARK-26803
 Project: Spark
  Issue Type: Improvement
  Components: Build, PySpark
Affects Versions: 2.4.0, 3.0.0
Reporter: Oliver Urs Lenz


At present, the `sbin` subdirectory is not included in Pyspark. This seems to 
be unintentional, and it makes e.g. the discussion at 
[https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html]
 confusing. In fact, the sbin directory can be downloaded manually into the 
pyspark directory, but it would be nice if it were included automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.

2019-01-31 Thread Oleg Frenkel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757663#comment-16757663
 ] 

Oleg Frenkel commented on SPARK-24736:
--

Supporting local files with --py-files would be great. By "local files" I mean 
files that are coming from the original machine outside of driver pod and not 
files available in the driver docker container.

> --py-files not functional for non local URLs. It appears to pass non-local 
> URL's into PYTHONPATH directly.
> --
>
> Key: SPARK-24736
> URL: https://issues.apache.org/jira/browse/SPARK-24736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
> Environment: Recent 2.4.0 from master branch, submitted on Linux to a 
> KOPS Kubernetes cluster created on AWS.
>  
>Reporter: Jonathan A Weaver
>Priority: Minor
>
> My spark-submit
> bin/spark-submit \
>         --master 
> k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]
>  \
>         --deploy-mode cluster \
>         --name pytest \
>         --conf 
> spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest]
>  \
>         --conf 
> [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver
>  \
>         --conf 
> spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/]
>  \
>         --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \
>         --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \
> --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \
> [https://s3.amazonaws.com/maxar-ids-fids/it.py]
>  
> *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()*
> 2018-07-01 07:33:43 INFO  SparkContext:54 - Added file 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp 
> 1530430423297
> 2018-07-01 07:33:43 INFO  Utils:54 - Fetching 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to 
> /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp
> *I print out the  PYTHONPATH and PYSPARK_FILES environment variables from the 
> driver script:*
>      PYTHONPATH 
> /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]*
>     PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip]
>  
> *I print out sys.path*
> ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', 
> u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240',
>  '/opt/spark/python/lib/pyspark.zip', 
> '/opt/spark/python/lib/py4j-0.10.7-src.zip', 
> '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', 
> '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', 
> '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',*
>  '/usr/lib/python27.zip', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/lib/python2.7/site-packages']
>  
> *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.*
>  
> *Dump of spark config from container.*
> Spark config dumped:
> [(u'spark.master', 
> u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'),
>  (u'spark.kubernetes.authenticate.submission.oauthToken', 
> u''), 
> (u'spark.kubernetes.authenticate.driver.oauthToken', 
> u''), (u'spark.kubernetes.executor.podNamePrefix', 
> u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), 
> (u'spark.driver.blockManager.port', u'7079'), 
> (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), 
> (u'[spark.app.name|http://spark.app.name/]', u'pytest'), 
> (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), 
> (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), 
> (u'spark.kubernetes.container.image', 
> 

[jira] [Updated] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display

2019-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-26726:
---
Fix Version/s: 2.3.3

>   Synchronize the amount of memory used by the broadcast variable to the UI 
> display
> ---
>
> Key: SPARK-26726
> URL: https://issues.apache.org/jira/browse/SPARK-26726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> The amount of memory used by the broadcast variable is not synchronized to 
> the UI display,
> spark-sql>  select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id;
> View the app's driver log:
> 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB)
>  2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 
> 2.5 GB)
>  
> Web UI does not have the use of memory,
> ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk 
> Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task 
> Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump||
> |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 
> B|0.0 
> B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout]
>  
> [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]|
> |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 
> ms)|0.0 B|0.0 B|0.0 B| |[Thread 
> Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]|
> |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 
> B|0.0 B|0.0 
> B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout]
>  
> [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]|
> |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 
> B|0.0 
> B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout]
>  
> [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability

2019-01-31 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-26802:
-
Description: 
Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
Spark 2.2.0 to 2.2.2
Spark 2.3.0 to 2.3.1

Description:
When using PySpark , it's possible for a different local user to connect to the 
Spark application and impersonate the user running the Spark application.  This 
affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.

Mitigation:
1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
2.3.x users should upgrade to 2.3.2 or newer
Otherwise, affected users should avoid using PySpark in multi-user environments.

Credit:
This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.

References:
https://spark.apache.org/security.html

This was fixed by
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 
(master / 2.4)
https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 
(branch-2.3)
https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 
(branch-2.2)

  was:
Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
Spark 2.2.0 to 2.2.2
Spark 2.3.0 to 2.3.1

Description:
When using PySpark , it's possible for a different local user to connect to the 
Spark application and impersonate the user running the Spark application.  This 
affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.

Mitigation:
1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
2.3.x users should upgrade to 2.3.2 or newer
Otherwise, affected users should avoid using PySpark in multi-user environments.

Credit:
This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.

References:
https://spark.apache.org/security.html

This was fixed by
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650
https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4
https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30


> CVE-2018-11760: Apache Spark local privilege escalation vulnerability
> -
>
> Key: SPARK-26802
> URL: https://issues.apache.org/jira/browse/SPARK-26802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Security
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2
>Reporter: Imran Rashid
>Assignee: Luca Canali
>Priority: Blocker
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> Severity: Important
> Vendor: The Apache Software Foundation
> Versions affected:
> All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
> Spark 2.2.0 to 2.2.2
> Spark 2.3.0 to 2.3.1
> Description:
> When using PySpark , it's possible for a different local user to connect to 
> the Spark application and impersonate the user running the Spark application. 
>  This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.
> Mitigation:
> 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
> 2.3.x users should upgrade to 2.3.2 or newer
> Otherwise, affected users should avoid using PySpark in multi-user 
> environments.
> Credit:
> This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.
> References:
> https://spark.apache.org/security.html
> This was fixed by
> https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650
>  (master / 2.4)
> https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4
>  (branch-2.3)
> https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30
>  (branch-2.2)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability

2019-01-31 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-26802:


 Summary: CVE-2018-11760: Apache Spark local privilege escalation 
vulnerability
 Key: SPARK-26802
 URL: https://issues.apache.org/jira/browse/SPARK-26802
 Project: Spark
  Issue Type: Bug
  Components: Security, PySpark
Affects Versions: 2.2.2, 2.1.3, 2.0.2, 1.6.3
Reporter: Imran Rashid
Assignee: Luca Canali
 Fix For: 2.4.0, 2.3.2, 2.2.3


Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
Spark 2.2.0 to 2.2.2
Spark 2.3.0 to 2.3.1

Description:
When using PySpark , it's possible for a different local user to connect to the 
Spark application and impersonate the user running the Spark application.  This 
affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.

Mitigation:
1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
2.3.x users should upgrade to 2.3.2 or newer
Otherwise, affected users should avoid using PySpark in multi-user environments.

Credit:
This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.

References:
https://spark.apache.org/security.html

This was fixed by
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650
https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4
https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability

2019-01-31 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-26802:
-
Description: 
Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
Spark 2.2.0 to 2.2.2
Spark 2.3.0 to 2.3.1

Description:
When using PySpark , it's possible for a different local user to connect to the 
Spark application and impersonate the user running the Spark application.  This 
affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.

Mitigation:
1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
2.3.x users should upgrade to 2.3.2 or newer
Otherwise, affected users should avoid using PySpark in multi-user environments.

Credit:
This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.

References:
https://spark.apache.org/security.html

This was fixed by
master / 2.4:
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650
https://github.com/apache/spark/commit/0df6bf882907d7d76572f513168a144067d0e0ec

branch-2.3
https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe

branch-2.2
https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 

branch-2.1 (note that this does not exist in any release)
https://github.com/apache/spark/commit/b2e0f68f615cbe2cf74f9813ece76c311fe8e911

  was:
Severity: Important

Vendor: The Apache Software Foundation

Versions affected:
All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
Spark 2.2.0 to 2.2.2
Spark 2.3.0 to 2.3.1

Description:
When using PySpark , it's possible for a different local user to connect to the 
Spark application and impersonate the user running the Spark application.  This 
affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.

Mitigation:
1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
2.3.x users should upgrade to 2.3.2 or newer
Otherwise, affected users should avoid using PySpark in multi-user environments.

Credit:
This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.

References:
https://spark.apache.org/security.html

This was fixed by
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 
(master / 2.4)
https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 
(branch-2.3)
https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 
(branch-2.2)


> CVE-2018-11760: Apache Spark local privilege escalation vulnerability
> -
>
> Key: SPARK-26802
> URL: https://issues.apache.org/jira/browse/SPARK-26802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Security
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2
>Reporter: Imran Rashid
>Assignee: Luca Canali
>Priority: Blocker
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> Severity: Important
> Vendor: The Apache Software Foundation
> Versions affected:
> All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
> Spark 2.2.0 to 2.2.2
> Spark 2.3.0 to 2.3.1
> Description:
> When using PySpark , it's possible for a different local user to connect to 
> the Spark application and impersonate the user running the Spark application. 
>  This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.
> Mitigation:
> 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
> 2.3.x users should upgrade to 2.3.2 or newer
> Otherwise, affected users should avoid using PySpark in multi-user 
> environments.
> Credit:
> This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.
> References:
> https://spark.apache.org/security.html
> This was fixed by
> master / 2.4:
> https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650
> https://github.com/apache/spark/commit/0df6bf882907d7d76572f513168a144067d0e0ec
> branch-2.3
> https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe
> branch-2.2
> https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30
>  
> branch-2.1 (note that this does not exist in any release)
> https://github.com/apache/spark/commit/b2e0f68f615cbe2cf74f9813ece76c311fe8e911



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability

2019-01-31 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-26802.
--
Resolution: Fixed

> CVE-2018-11760: Apache Spark local privilege escalation vulnerability
> -
>
> Key: SPARK-26802
> URL: https://issues.apache.org/jira/browse/SPARK-26802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Security
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2
>Reporter: Imran Rashid
>Assignee: Luca Canali
>Priority: Blocker
> Fix For: 2.4.0, 2.3.2, 2.2.3
>
>
> Severity: Important
> Vendor: The Apache Software Foundation
> Versions affected:
> All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
> Spark 2.2.0 to 2.2.2
> Spark 2.3.0 to 2.3.1
> Description:
> When using PySpark , it's possible for a different local user to connect to 
> the Spark application and impersonate the user running the Spark application. 
>  This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1.
> Mitigation:
> 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer
> 2.3.x users should upgrade to 2.3.2 or newer
> Otherwise, affected users should avoid using PySpark in multi-user 
> environments.
> Credit:
> This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN.
> References:
> https://spark.apache.org/security.html
> This was fixed by
> https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650
> https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4
> https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display

2019-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26726:
--

Assignee: hantiantian

>   Synchronize the amount of memory used by the broadcast variable to the UI 
> display
> ---
>
> Key: SPARK-26726
> URL: https://issues.apache.org/jira/browse/SPARK-26726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
>
> The amount of memory used by the broadcast variable is not synchronized to 
> the UI display,
> spark-sql>  select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id;
> View the app's driver log:
> 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB)
>  2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 
> 2.5 GB)
>  
> Web UI does not have the use of memory,
> ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk 
> Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task 
> Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump||
> |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 
> B|0.0 
> B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout]
>  
> [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]|
> |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 
> ms)|0.0 B|0.0 B|0.0 B| |[Thread 
> Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]|
> |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 
> B|0.0 B|0.0 
> B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout]
>  
> [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]|
> |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 
> B|0.0 
> B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout]
>  
> [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26801) Spark unable to read valid avro types

2019-01-31 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-26801:


 Summary: Spark unable to read valid avro types
 Key: SPARK-26801
 URL: https://issues.apache.org/jira/browse/SPARK-26801
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dhruve Ashar


Currently the external avro package reads avro schemas for type records only. 
This is probably because of representation of InternalRow in spark sql. As a 
result, if the avro file has anything other than a sequence of records it fails 
to read it.

We faced this issue earlier while trying to read primitive types. We 
encountered this again while trying to read an array of records. Below are code 
examples trying to read valid avro data showing the stack traces.
{code:java}
spark.read.format("avro").load("avroTypes/randomInt.avro").show
java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
StructType:

"int"

at org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided

==

scala> spark.read.format("avro").load("avroTypes/randomEnum.avro").show
java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
StructType:

{
"type" : "enum",
"name" : "Suit",
"symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ]
}

at org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display

2019-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26726.

   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23649
[https://github.com/apache/spark/pull/23649]

>   Synchronize the amount of memory used by the broadcast variable to the UI 
> display
> ---
>
> Key: SPARK-26726
> URL: https://issues.apache.org/jira/browse/SPARK-26726
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>
> The amount of memory used by the broadcast variable is not synchronized to 
> the UI display,
> spark-sql>  select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id;
> View the app's driver log:
> 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB)
>  2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 
> 2.5 GB)
>  2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 
> 2.5 GB)
>  
> Web UI does not have the use of memory,
> ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk 
> Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task 
> Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump||
> |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 
> B|0.0 
> B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout]
>  
> [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]|
> |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 
> ms)|0.0 B|0.0 B|0.0 B| |[Thread 
> Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]|
> |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 
> B|0.0 B|0.0 
> B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout]
>  
> [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]|
> |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 
> B|0.0 
> B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout]
>  
> [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread
>  Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26744) Support schema validation in File Source V2

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26744:


Assignee: (was: Apache Spark)

> Support schema validation in File Source V2
> ---
>
> Key: SPARK-26744
> URL: https://issues.apache.org/jira/browse/SPARK-26744
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26744) Support schema validation in File Source V2

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26744:


Assignee: Apache Spark

> Support schema validation in File Source V2
> ---
>
> Key: SPARK-26744
> URL: https://issues.apache.org/jira/browse/SPARK-26744
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2

2019-01-31 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-26744:
---
Summary: Support schema validation in File Source V2  (was: Create API 
supportDataType in File Source V2 framework)

> Support schema validation in File Source V2
> ---
>
> Key: SPARK-26744
> URL: https://issues.apache.org/jira/browse/SPARK-26744
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26799:


Assignee: (was: Apache Spark)

> Make ANTLR v4 version consistent between Maven and SBT
> --
>
> Key: SPARK-26799
> URL: https://issues.apache.org/jira/browse/SPARK-26799
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> Currently ANTLR v4 versions used by Maven and SBT are slightly different. 
> Maven uses 4.7.1 while SBT uses 4.7.
>  * Maven(pom.xml): 4.7.1
>  * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7"
> We should make Maven and SBT use a single version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26800) JDBC - MySQL nullable option is ignored

2019-01-31 Thread Francisco Miguel Biete Banon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Miguel Biete Banon updated SPARK-26800:
-
Description: 
Spark 2.4.0
MySQL 5.7.21 (docker official MySQL image running with default config)

Writing a dataframe with optionally null fields result in a table with NOT NULL 
attributes in MySQL.

{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import java.sql.Timestamp

val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York"))
val schema = StructType(
StructField("id", IntegerType, true) ::
StructField("when", TimestampType, true) ::
StructField("city", StringType, true) :: Nil)
println(schema.toDDL)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", 
jdbcProperties){code}

Produces

{code}
CREATE TABLE `temp_bug` (
  `id` int(11) DEFAULT NULL,
  `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP,
  `city` text
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
{code}

I would expect "when" column to be defined as nullable.

  was:
Spark 2.4.0
MySQL 5.7.21 (docker official MySQL image running with default config)

Writing a dataframe with optionally null fields result in a table with NOT NULL 
attributes in MySQL.

{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import java.sql.Timestamp

val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York"))
val schema = StructType(
StructField("id", IntegerType) ::
StructField("when", TimestampType, true) ::
StructField("city", StringType) :: Nil)
println(schema.toDDL)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", 
jdbcProperties){code}

Produces

{code}
CREATE TABLE `temp_bug` (
  `id` int(11) DEFAULT NULL,
  `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP,
  `city` text
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
{code}

I would expect "when" column to be defined as nullable.


> JDBC - MySQL nullable option is ignored
> ---
>
> Key: SPARK-26800
> URL: https://issues.apache.org/jira/browse/SPARK-26800
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Francisco Miguel Biete Banon
>Priority: Minor
>
> Spark 2.4.0
> MySQL 5.7.21 (docker official MySQL image running with default config)
> Writing a dataframe with optionally null fields result in a table with NOT 
> NULL attributes in MySQL.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{Row, SaveMode}
> import java.sql.Timestamp
> val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York"))
> val schema = StructType(
> StructField("id", IntegerType, true) ::
> StructField("when", TimestampType, true) ::
> StructField("city", StringType, true) :: Nil)
> println(schema.toDDL)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", 
> jdbcProperties){code}
> Produces
> {code}
> CREATE TABLE `temp_bug` (
>   `id` int(11) DEFAULT NULL,
>   `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP,
>   `city` text
> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
> {code}
> I would expect "when" column to be defined as nullable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26799:


Assignee: Apache Spark

> Make ANTLR v4 version consistent between Maven and SBT
> --
>
> Key: SPARK-26799
> URL: https://issues.apache.org/jira/browse/SPARK-26799
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently ANTLR v4 versions used by Maven and SBT are slightly different. 
> Maven uses 4.7.1 while SBT uses 4.7.
>  * Maven(pom.xml): 4.7.1
>  * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7"
> We should make Maven and SBT use a single version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26800) JDBC - MySQL nullable option is ignored

2019-01-31 Thread Francisco Miguel Biete Banon (JIRA)
Francisco Miguel Biete Banon created SPARK-26800:


 Summary: JDBC - MySQL nullable option is ignored
 Key: SPARK-26800
 URL: https://issues.apache.org/jira/browse/SPARK-26800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Francisco Miguel Biete Banon


Spark 2.4.0
MySQL 5.7.21 (docker official MySQL image running with default config)

Writing a dataframe with optionally null fields result in a table with NOT NULL 
attributes in MySQL.

{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import java.sql.Timestamp

val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York"))
val schema = StructType(
StructField("id", IntegerType) ::
StructField("when", TimestampType, true) ::
StructField("city", StringType) :: Nil)
println(schema.toDDL)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", 
jdbcProperties){code}

Produces

{code}
CREATE TABLE `temp_bug` (
  `id` int(11) DEFAULT NULL,
  `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP,
  `city` text
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
{code}

I would expect "when" column to be defined as nullable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT

2019-01-31 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26799:


 Summary: Make ANTLR v4 version consistent between Maven and SBT
 Key: SPARK-26799
 URL: https://issues.apache.org/jira/browse/SPARK-26799
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Currently ANTLR v4 versions used by Maven and SBT are slightly different. Maven 
uses 4.7.1 while SBT uses 4.7.
 * Maven(pom.xml): 4.7.1
 * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7"

We should make Maven and SBT use a single version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26798) HandleNullInputsForUDF should trust nullability

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26798:


Assignee: Wenchen Fan  (was: Apache Spark)

> HandleNullInputsForUDF should trust nullability
> ---
>
> Key: SPARK-26798
> URL: https://issues.apache.org/jira/browse/SPARK-26798
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26798) HandleNullInputsForUDF should trust nullability

2019-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26798:


Assignee: Apache Spark  (was: Wenchen Fan)

> HandleNullInputsForUDF should trust nullability
> ---
>
> Key: SPARK-26798
> URL: https://issues.apache.org/jira/browse/SPARK-26798
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26798) HandleNullInputsForUDF should trust nullability

2019-01-31 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-26798:
---

 Summary: HandleNullInputsForUDF should trust nullability
 Key: SPARK-26798
 URL: https://issues.apache.org/jira/browse/SPARK-26798
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2019-01-31 Thread Nicholas Resnick (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757326#comment-16757326
 ] 

Nicholas Resnick commented on SPARK-17998:
--

Going to answer my question: it is in fact a function of the number of cores 
available. The key definition is here: 
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]

```

Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

```

The first two values are configurable, and the third is roughly the size of the 
file divided by the number of cores (according to sc.defaultParallelism.) If 
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will 
equal bytesPerCore, which means we should expect around one partition for each 
core.

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2019-01-31 Thread Nicholas Resnick (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757326#comment-16757326
 ] 

Nicholas Resnick edited comment on SPARK-17998 at 1/31/19 3:03 PM:
---

Going to answer my question: maxSplitBytes is in fact a function of the number 
of cores available. The key definition is here: 
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]
 
{code:java}
Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
{code}
The first two values are configurable, and the third is roughly the size of the 
file divided by the number of cores (according to sc.defaultParallelism.) If 
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will 
equal bytesPerCore, which means we should expect around one partition for each 
core.


was (Author: nresnick):
Going to answer my question: it is in fact a function of the number of cores 
available. The key definition is here: 
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]

 
{code:java}
Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
{code}
The first two values are configurable, and the third is roughly the size of the 
file divided by the number of cores (according to sc.defaultParallelism.) If 
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will 
equal bytesPerCore, which means we should expect around one partition for each 
core.

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-31 Thread vishnuram selvaraj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757327#comment-16757327
 ] 

vishnuram selvaraj commented on SPARK-26786:


We should be able to treat the escaped newlines very similar to how we are 
treating the escaped quotes in the data when reading the file into a dataframe.

Just like how the python's CSV reader is removing the escape characters infront 
of both quote and newline characters

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  *Expected result:*
> {code:java}
> +---+--+
> |_c0| _c1|
> +---+--+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2019-01-31 Thread Nicholas Resnick (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757326#comment-16757326
 ] 

Nicholas Resnick edited comment on SPARK-17998 at 1/31/19 3:02 PM:
---

Going to answer my question: it is in fact a function of the number of cores 
available. The key definition is here: 
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]

 
{code:java}
Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
{code}
The first two values are configurable, and the third is roughly the size of the 
file divided by the number of cores (according to sc.defaultParallelism.) If 
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will 
equal bytesPerCore, which means we should expect around one partition for each 
core.


was (Author: nresnick):
Going to answer my question: it is in fact a function of the number of cores 
available. The key definition is here: 
[https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86]

```

Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

```

The first two values are configurable, and the third is roughly the size of the 
file divided by the number of cores (according to sc.defaultParallelism.) If 
defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will 
equal bytesPerCore, which means we should expect around one partition for each 
core.

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25153) Improve error messages for columns with dots/periods

2019-01-31 Thread Mikhail (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757286#comment-16757286
 ] 

Mikhail commented on SPARK-25153:
-

Hello [~blavigne]
Are you still working on this?

Hello [~holdenk]
Could you please provide the example of what you mean (current and expected 
behavior)? Thanks in advance

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26673) File source V2 write: create framework and migrate ORC to it

2019-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26673:
---

Assignee: Gengliang Wang

> File source V2 write: create framework and migrate ORC to it
> 
>
> Key: SPARK-26673
> URL: https://issues.apache.org/jira/browse/SPARK-26673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26673) File source V2 write: create framework and migrate ORC to it

2019-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26673.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23601
[https://github.com/apache/spark/pull/23601]

> File source V2 write: create framework and migrate ORC to it
> 
>
> Key: SPARK-26673
> URL: https://issues.apache.org/jira/browse/SPARK-26673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one

2019-01-31 Thread Zoltan Ivanfi (JIRA)
Zoltan Ivanfi created SPARK-26797:
-

 Summary: Start using the new logical types API of Parquet 1.11.0 
instead of the deprecated one
 Key: SPARK-26797
 URL: https://issues.apache.org/jira/browse/SPARK-26797
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zoltan Ivanfi


The 1.11.0 release of parquet-mr will deprecate its logical type API in favour 
of a newly introduced one. The new API also introduces new subtypes for 
different timestamp semantics, support for which should be added to Spark in 
order to read those types correctly.

At this point only a release candidate of parquet-mr 1.11.0 is available, but 
that already allows implementing and reviewing this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26345) Parquet support Column indexes

2019-01-31 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757209#comment-16757209
 ] 

Zoltan Ivanfi edited comment on SPARK-26345 at 1/31/19 1:02 PM:


Please note that column indexes will automatically get utilized if 
[spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader]
 = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand 
(which is the default), then column indexes could only be utilized by 
duplicating the internal logic of parquet-mr, which would be disproportonate 
effort. We, the developers of the column index feature in parquet-mr did not 
expect Spark to make this huge investment, and we would like to provide a 
vectorized API instead in a future release of parquet-mr.


was (Author: zi):
Please note that column indexes will automatically get utilized if 
[spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader]
 = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand 
(which is the default), then column indexes could only be utilized by 
duplicating the internal logic of parquet-mr, which would be disproportonate 
effort. We, the developers of the column index feature did not expect Spark to 
make this huge investment, and we would like to provide a vectorized API 
instead in a future release.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2019-01-31 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757209#comment-16757209
 ] 

Zoltan Ivanfi commented on SPARK-26345:
---

Please note that column indexes will automatically get utilized if 
[spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader]
 = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand 
(which is the default), then column indexes could only be utilized by 
duplicating the internal logic of parquet-mr, which would be disproportonate 
effort. We, the developers of the column index feature did not expect Spark to 
make this huge investment, and we would like to provide a vectorized API 
instead in a future release.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

2019-01-31 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757199#comment-16757199
 ] 

Gabor Somogyi commented on SPARK-25136:
---

[~kerbylane] did you have time to check it?

> unable to use HDFS checkpoint directories after driver restart
> --
>
> Key: SPARK-25136
> URL: https://issues.apache.org/jira/browse/SPARK-25136
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of 
> SPARK-20894. The problem that we are encountering is the inability of the 
> Spark Driver to process additional data after restart. Upon restart it 
> reprocesses batches from the initial run. But when processing the first batch 
> it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We 
> restart the Spark Driver to renew the Kerberos token in use to access HDFS. 
> Excerpts of log lines from a run are included below. In short, when the new 
> driver starts it reports that it is going to use a new checkpoint directory 
> in HDFS. Workers report that they have loaded a StateStoreProviders that 
> matches the directory. But then the worker reports that it cannot read the 
> delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory 
> permissions for the original and new checkpoint directories are the same. The 
> worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes 
> subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of 
> HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 
> for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]
>  to file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for 
> HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = 
> e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = 
> b7ee0163-47db-4392-ab66-94d36ce63074]. Use 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040
>  to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 
> 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading 
> delta file 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  of HDFSStateStoreProvider[id = (op=0,part=0),dir = 
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]:
>  
> hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta
>  does not exist}}
> {{Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}
>  
> Worker log entries
> {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance 
> 

[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26795:

Labels:   (was: shuffle)

> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
> Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26795:

Target Version/s: 2.4.0, 2.3.2  (was: 2.3.2, 2.4.0)
  Labels: shuffle  (was: )

> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
>  Labels: shuffle
> Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26795:

Description: 
There is a parameter *spark.maxRemoteBlockSizeFetchToMem*, which means the 
remote block will be fetched to disk when size of the block is above this 
threshold in bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.

  was:
There is a parameter `spark.maxRemoteBlockSizeFetchToMem`, which means the 
remote block will be fetched to disk when size of the block is above this 
threshold in bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.


> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
> Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>
>
> There is a parameter *spark.maxRemoteBlockSizeFetchToMem*, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26795:

Description: 
There is a parameter `spark.maxRemoteBlockSizeFetchToMem`, which means the 
remote block will be fetched to disk when size of the block is above this 
threshold in bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.

  was:
There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the remote 
block will be fetched to disk when size of the block is above this threshold in 
bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.


> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
> Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>
>
> There is a parameter `spark.maxRemoteBlockSizeFetchToMem`, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26795:

Description: 
There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the remote 
block will be fetched to disk when size of the block is above this threshold in 
bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.

  was:
There is a parameter *spark.maxRemoteBlockSizeFetchToMem*, which means the 
remote block will be fetched to disk when size of the block is above this 
threshold in bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.


> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
> Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-01-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26795:

Description: 
There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the remote 
block will be fetched to disk when size of the block is above this threshold in 
bytes.

So during shuffle read phase, the managedBuffer which throw IOException may be 
a remote downloaded FileSegment and should be retried instead of 
throwFetchFailed directly.

  was:During shuffle read phase, the  managedBuffer which throw IOException may 
be a remote download FileSegment and should be retry instead of 
throwFetchFailed directly.


> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
> Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24023:


Assignee: Huaxin Gao

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Huaxin Gao
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> It's could be messy if there are duplicates for R sides in SPARK-23899. 
> Followup for each JIRA might be possible but then again messy to manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24023.
--
Resolution: Done

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Huaxin Gao
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> It's could be messy if there are duplicates for R sides in SPARK-23899. 
> Followup for each JIRA might be possible but then again messy to manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24779) Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24779:


Assignee: Huaxin Gao

> Add map_concat  / map_from_entries / an option in months_between UDF to 
> disable rounding-off
> 
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24779) Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24779:
-
Summary: Add map_concat  / map_from_entries / an option in months_between 
UDF to disable rounding-off  (was: Add sequence / map_concat  / 
map_from_entries / an option in months_between UDF to disable rounding-off)

> Add map_concat  / map_from_entries / an option in months_between UDF to 
> disable rounding-off
> 
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24779) Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24779.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21835
[https://github.com/apache/spark/pull/21835]

> Add map_concat  / map_from_entries / an option in months_between UDF to 
> disable rounding-off
> 
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Add R versions of 
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24779) Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2019-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24779:
-
Description: 
Add R versions of 
 * map_concat   -SPARK-23936-
 * map_from_entries   SPARK-23934
 * an option in months_between UDF to disable rounding-off  -SPARK-23902-

  was:
Add R versions of 
 * sequence -SPARK-23927-
 * map_concat   -SPARK-23936-
 * map_from_entries   SPARK-23934
 * an option in months_between UDF to disable rounding-off  -SPARK-23902-


> Add sequence / map_concat  / map_from_entries / an option in months_between 
> UDF to disable rounding-off
> ---
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error

2019-01-31 Thread Anuja Jakhade (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26796:
--
Environment: 
Ubuntu 16.04 

Java Version

openjdk version "1.8.0_192"
OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9)
Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Linux s390x-64-Bit Compressed 
References 20181107_80 (JIT enabled, AOT enabled)
OpenJ9 - 090ff9dcd
OMR - ea548a66
JCL - b5a3affe73 based on jdk8u192-b12)

 

Hadoop  Version

Hadoop 2.7.1
 Subversion Unknown -r Unknown
 Compiled by test on 2019-01-29T09:09Z
 Compiled with protoc 2.5.0
 From source with checksum 5e94a235f9a71834e2eb73fb36ee873f
 This command was run using 
/home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar

 

 

 

  was:
Ubuntu 16.04 

Java Version

openjdk version "1.8.0_191"
 OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12)
 OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode)

 

Hadoop  Version

Hadoop 2.7.1
 Subversion Unknown -r Unknown
 Compiled by test on 2019-01-29T09:09Z
 Compiled with protoc 2.5.0
 From source with checksum 5e94a235f9a71834e2eb73fb36ee873f
 This command was run using 
/home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar

 

 

 


> Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
> -
>
> Key: SPARK-26796
> URL: https://issues.apache.org/jira/browse/SPARK-26796
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.2, 2.4.0
> Environment: Ubuntu 16.04 
> Java Version
> openjdk version "1.8.0_192"
> OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9)
> Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Linux s390x-64-Bit 
> Compressed References 20181107_80 (JIT enabled, AOT enabled)
> OpenJ9 - 090ff9dcd
> OMR - ea548a66
> JCL - b5a3affe73 based on jdk8u192-b12)
>  
> Hadoop  Version
> Hadoop 2.7.1
>  Subversion Unknown -r Unknown
>  Compiled by test on 2019-01-29T09:09Z
>  Compiled with protoc 2.5.0
>  From source with checksum 5e94a235f9a71834e2eb73fb36ee873f
>  This command was run using 
> /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
>  
>  
>  
>Reporter: Anuja Jakhade
>Priority: Major
>
> Observing test case failures due to Checksum error 
> Below is the error log
> [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time 
> elapsed: 1.232 s <<< ERROR!
> org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor 
> driver): org.apache.hadoop.fs.ChecksumException: Checksum error: 
> file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0
>  at 0 exp: 222499834 got: 1400184476
>  at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323)
>  at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279)
>  at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214)
>  at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232)
>  at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769)
>  at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785)
>  at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262)
>  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968)
>  at java.io.ObjectInputStream.(ObjectInputStream.java:390)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
>  at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300)
>  at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> 

  1   2   >