[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI
[ https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758037#comment-16758037 ] Jungtaek Lim commented on SPARK-26792: -- Hi [~Thatboix45], looks like you voted this issue: do you have actual use case of this? If your case is different then pointing YARN log url to the log directory, we may want to apply this instead of changing default value of YARN log executor url. > Apply custom log URL to Spark UI > > > Key: SPARK-26792 > URL: https://issues.apache.org/jira/browse/SPARK-26792 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed > apps. > While getting reviews from SPARK-23155, I've got two comments which applying > custom log URLs to UI would help achieving it. Quoting these comments here: > https://github.com/apache/spark/pull/23260#issuecomment-456827963 > {quote} > Sorry I haven't had time to look through all the code so this might be a > separate jira, but one thing I thought of here is it would be really nice not > to have specifically stderr/stdout. users can specify any log4j.properties > and some tools like oozie by default end up using hadoop log4j rather then > spark log4j, so files aren't necessarily the same. Also users can put in > other logs files so it would be nice to have links to those from the UI. It > seems simpler if we just had a link to the directory and it read the files > within there. Other things in Hadoop do it this way, but I'm not sure if that > works well for other resource managers, any thoughts on that? As long as this > doesn't prevent the above I can file a separate jira for it. > {quote} > https://github.com/apache/spark/pull/23260#issuecomment-456904716 > {quote} > Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We > typically configure Spark jobs to write the GC log and dump heap on OOM > using , and/or we use the rolling file appender to deal with > large logs during debugging. So linking the YARN container log overview > page would make much more sense for us. We work it around with a custom > submit process that logs all important URLs on the submit side log. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-24404) Increase currentEpoch when meet a EpochMarker in ContinuousQueuedDataReader.next() in CP mode based on PR #21353 #21332 #21293 and the latest master
[ https://issues.apache.org/jira/browse/SPARK-24404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liangchang Zhu closed SPARK-24404. -- > Increase currentEpoch when meet a EpochMarker in > ContinuousQueuedDataReader.next() in CP mode based on PR #21353 #21332 > #21293 and the latest master > -- > > Key: SPARK-24404 > URL: https://issues.apache.org/jira/browse/SPARK-24404 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Liangchang Zhu >Priority: Major > > In CP mode, based on PR #21353 #21332 #21293 and the latest master > ContinuousQueuedDataReader.next() will be invoked by > ContinuousDataSourceRDD.compute to return UnsafeRow. When currentEntry polled > from ArrayBlockingQueue is a EpochMarker, ContinuousQueuedDataReader will > send `ReportPartitionOffset` message to epochCoordinator with currentEpoch of > EpochTracker. The currentEpoch is a ThreadLocal variable, but now no place > invoke `incrementCurrentEpoch` to increase currentEpoch in its thread, so > `getCurrentEpoch` will return `None` all the time(because currentEpoch is > -1). This will cause exception when invoke `None.get`. At the same time, in > order to make the `ReportPartitionOffset` have correct semantics, we need > increase currentEpoch before send this message -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24404) Increase currentEpoch when meet a EpochMarker in ContinuousQueuedDataReader.next() in CP mode based on PR #21353 #21332 #21293 and the latest master
[ https://issues.apache.org/jira/browse/SPARK-24404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liangchang Zhu resolved SPARK-24404. Resolution: Won't Fix > Increase currentEpoch when meet a EpochMarker in > ContinuousQueuedDataReader.next() in CP mode based on PR #21353 #21332 > #21293 and the latest master > -- > > Key: SPARK-24404 > URL: https://issues.apache.org/jira/browse/SPARK-24404 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Liangchang Zhu >Priority: Major > > In CP mode, based on PR #21353 #21332 #21293 and the latest master > ContinuousQueuedDataReader.next() will be invoked by > ContinuousDataSourceRDD.compute to return UnsafeRow. When currentEntry polled > from ArrayBlockingQueue is a EpochMarker, ContinuousQueuedDataReader will > send `ReportPartitionOffset` message to epochCoordinator with currentEpoch of > EpochTracker. The currentEpoch is a ThreadLocal variable, but now no place > invoke `incrementCurrentEpoch` to increase currentEpoch in its thread, so > `getCurrentEpoch` will return `None` all the time(because currentEpoch is > -1). This will cause exception when invoke `None.get`. At the same time, in > order to make the `ReportPartitionOffset` have correct semantics, we need > increase currentEpoch before send this message -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26525) Fast release memory of ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-26525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26525: --- Assignee: liupengcheng > Fast release memory of ShuffleBlockFetcherIterator > -- > > Key: SPARK-26525 > URL: https://issues.apache.org/jira/browse/SPARK-26525 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.2 >Reporter: liupengcheng >Assignee: liupengcheng >Priority: Major > > Currently, spark would not release ShuffleBlockFetcherIterator until the > whole task finished. > In some conditions, it incurs memory leak. > An example is Shuffle -> map -> Coalesce(shuffle = false). Each > ShuffleBlockFetcherIterator contains some metas about > MapStatus(blocksByAddress) and each ShuffleMapTask will keep n(max to shuffle > partitions) shuffleBlockFetcherIterator for they are refered by > onCompleteCallbacks of TaskContext, in some case, it may take huge memory and > the memory will not released until the task finished. > Actually, We can release ShuffleBlockFetcherIterator as soon as it's consumed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24541) TCP based shuffle
[ https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757985#comment-16757985 ] Jungtaek Lim commented on SPARK-24541: -- Continuous processing requires "single stage" to let all tasks launching all the time, as well as processing never stops. It works with query which only has map-like transformations, but stateful operations normally requires hash-partitioning so Spark should have some mechanism to route rows directly (as same as other streaming frameworks are doing). Some of works were done around more than half a year ago so I cannot remember correctly, but the POC-like work is done with RPC, and this issue tracks the effort of changing them to TCP. > TCP based shuffle > - > > Key: SPARK-24541 > URL: https://issues.apache.org/jira/browse/SPARK-24541 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2
[ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26744: --- Description: The internal API supportDataType in FileFormat validates the output/input schema before task execution starts. So that we can avoid launching read/write tasks which would fail. Also, users can see clean error messages. This PR is to implement the same internal API in the FileDataSourceV2 framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added in two places: 1. Read path: the table schema is determined in TableProvider.getTable. The actual read schema can be a subset of the table schema. This PR proposes to validate the actual read schema in FileScan. 2. Write path: validate the actual output schema in FileWriteBuilder. was: The method supportDataType in FileFormat helps to validate the output/input schema before execution starts. So that we can avoid some invalid data source IO, and users can see clean error messages. This PR is to implement the same method in the FileDataSourceV2 framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added in two places: 1. FileWriteBuilder: this is where we can get the actual write schema 2. FileScan: this is where we can get the actual read schema. > Support schema validation in File Source V2 > --- > > Key: SPARK-26744 > URL: https://issues.apache.org/jira/browse/SPARK-26744 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The internal API supportDataType in FileFormat validates the output/input > schema before task execution starts. So that we can avoid launching > read/write tasks which would fail. Also, users can see clean error messages. > This PR is to implement the same internal API in the FileDataSourceV2 > framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The > API is added in two places: > 1. Read path: the table schema is determined in TableProvider.getTable. The > actual read schema can be a subset of the table schema. This PR proposes to > validate the actual read schema in FileScan. > 2. Write path: validate the actual output schema in FileWriteBuilder. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24541) TCP based shuffle
[ https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757936#comment-16757936 ] Imran Rashid commented on SPARK-24541: -- can you explain what this means at all? regular spark shuffles are already done with TCP, but I'm sure you are referring to something entirely different here > TCP based shuffle > - > > Key: SPARK-24541 > URL: https://issues.apache.org/jira/browse/SPARK-24541 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2
[ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26744: --- Description: The method supportDataType in FileFormat helps to validate the output/input schema before execution starts. So that we can avoid some invalid data source IO, and users can see clean error messages. This PR is to implement the same method in the FileDataSourceV2 framework. Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is added in two places: 1. FileWriteBuilder: this is where we can get the actual write schema 2. FileScan: this is where we can get the actual read schema. > Support schema validation in File Source V2 > --- > > Key: SPARK-26744 > URL: https://issues.apache.org/jira/browse/SPARK-26744 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The method supportDataType in FileFormat helps to validate the output/input > schema before execution starts. So that we can avoid some invalid data source > IO, and users can see clean error messages. > This PR is to implement the same method in the FileDataSourceV2 framework. > Comparing to FileFormat, FileDataSourceV2 has multiple layers. The API is > added in two places: > 1. FileWriteBuilder: this is where we can get the actual write schema > 2. FileScan: this is where we can get the actual read schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26730) Strip redundant AssertNotNull expression for ExpressionEncoder's serializer
[ https://issues.apache.org/jira/browse/SPARK-26730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26730. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23651 [https://github.com/apache/spark/pull/23651] > Strip redundant AssertNotNull expression for ExpressionEncoder's serializer > --- > > Key: SPARK-26730 > URL: https://issues.apache.org/jira/browse/SPARK-26730 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > Fix For: 3.0.0 > > > For types like Product, we've already add AssertNotNull when we construct > serializer(pls see the code below), so we could strip redundant AssertNotNull > for those types. Please see the code with the related PR for details. > > {code:java} > val fieldValue = Invoke( > AssertNotNull(inputObject, walkedTypePath), fieldName, dataTypeFor(fieldType), > returnNullable = !fieldType.typeSymbol.asClass.isPrimitive) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26730) Strip redundant AssertNotNull expression for ExpressionEncoder's serializer
[ https://issues.apache.org/jira/browse/SPARK-26730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26730: --- Assignee: wuyi > Strip redundant AssertNotNull expression for ExpressionEncoder's serializer > --- > > Key: SPARK-26730 > URL: https://issues.apache.org/jira/browse/SPARK-26730 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > For types like Product, we've already add AssertNotNull when we construct > serializer(pls see the code below), so we could strip redundant AssertNotNull > for those types. Please see the code with the related PR for details. > > {code:java} > val fieldValue = Invoke( > AssertNotNull(inputObject, walkedTypePath), fieldName, dataTypeFor(fieldType), > returnNullable = !fieldType.typeSymbol.asClass.isPrimitive) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv
[ https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757898#comment-16757898 ] Hyukjin Kwon commented on SPARK-26786: -- This behaviour is inherited from Univocity parser if I am not mistaken. Can you iterate with this issue there? > Handle to treat escaped newline characters('\r','\n') in spark csv > -- > > Key: SPARK-26786 > URL: https://issues.apache.org/jira/browse/SPARK-26786 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, SQL >Affects Versions: 2.3.0 >Reporter: vishnuram selvaraj >Priority: Major > > There are some systems like AWS redshift which writes csv files by escaping > newline characters('\r','\n') in addition to escaping the quote characters, > if they come as part of the data. > Redshift documentation > link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and > below is their mention of escaping requirements in the mentioned link > ESCAPE > For CHAR and VARCHAR columns in delimited unload files, an escape character > (\{{}}) is placed before every occurrence of the following characters: > * Linefeed: {{\n}} > * Carriage return: {{\r}} > * The delimiter character specified for the unloaded data. > * The escape character: \{{}} > * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are > specified in the UNLOAD command). > > *Problem statement:* > But the spark CSV reader doesn't have a handle to treat/remove the escape > characters infront of the newline characters in the data. > It would really help if we can add a feature to handle the escaped newline > characters through another parameter like (escapeNewline = 'true/false'). > *Example:* > Below are the details of my test data set up in a file. > * The first record in that file has escaped windows newline character ( > r > n) > * The third record in that file has escaped unix newline character ( > n) > * The fifth record in that file has the escaped quote character (") > the file looks like below in vi editor: > > {code:java} > "1","this is \^M\ > line1"^M > "2","this is line2"^M > "3","this is \ > line3"^M > "4","this is \" line4"^M > "5","this is line5"^M{code} > > When I read the file in python's csv module with escape, it is able to remove > the added escape characters as you can see below, > > {code:java} > >>> with open('/tmp/test3.csv','r') as readCsv: > ... readFile = > csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False) > ... for row in readFile: > ... print(row) > ... > ['1', 'this is \r\n line1'] > ['2', 'this is line2'] > ['3', 'this is \n line3'] > ['4', 'this is " line4'] > ['5', 'this is line5'] > {code} > But if I read the same file in spark-csv reader, the escape characters > infront of the newline characters are not removed.But the escape before the > (") is removed. > {code:java} > >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false') > >>> redDf.show() > +---+--+ > |_c0| _c1| > +---+--+ > \ 1|this is \ > line1| > | 2| this is line2| > | 3| this is \ > line3| > | 4| this is " line4| > | 5| this is line5| > +---+--+ > {code} > *Expected result:* > {code:java} > +---+--+ > |_c0| _c1| > +---+--+ > | 1|this is > line1| > | 2| this is line2| > | 3| this is > line3| > | 4| this is " line4| > | 5| this is line5| > +---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26787) Fix standardization error message in WeightedLeastSquares
[ https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26787. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23705 [https://github.com/apache/spark/pull/23705] > Fix standardization error message in WeightedLeastSquares > - > > Key: SPARK-26787 > URL: https://issues.apache.org/jira/browse/SPARK-26787 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.0, 2.3.1, 2.4.0 > Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML > Beta. > >Reporter: Brian Scannell >Assignee: Brian Scannell >Priority: Trivial > Fix For: 3.0.0 > > > There is an error message in WeightedLeastSquares.scala that is incorrect and > thus not very helpful for diagnosing an issue. The problem arises when doing > regularized LinearRegression on a constant label. Even when the parameter > standardization=False, the error will falsely state that standardization was > set to True: > {{The standard deviation of the label is zero. Model cannot be regularized > with standardization=true}} > This is because under the hood, LinearRegression automatically sets a > parameter standardizeLabel=True. This was chosen for consistency with GLMNet, > although WeightedLeastSquares is written to allow standardizeLabel to be set > either way and work (although the public LinearRegression API does not allow > it). > > I will submit a pull request with my suggested wording. > > Relevant: > [https://github.com/apache/spark/pull/10702] > [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5] > > > The following Python code will replicate the error. > {code:java} > import pandas as pd > from pyspark.ml.feature import VectorAssembler > from pyspark.ml.regression import LinearRegression > df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]}) > spark_df = spark.createDataFrame(df) > vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = > 'features') > train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label']) > lr = LinearRegression(featuresCol='features', labelCol='label', > fitIntercept=False, standardization=False, regParam=1e-4) > lr_model = lr.fit(train_sdf) > {code} > > For context, the reason someone might want to do this is if they are trying > to fit a model to estimate components of a fixed total. The label indicates > the total is always 100%, but the components vary. For example, trying to > estimate the unknown weights of different quantities of substances in a > series of full bins. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema
[ https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24959: - Fix Version/s: (was: 2.4.0) > Do not invoke the CSV/JSON parser for empty schema > -- > > Key: SPARK-24959 > URL: https://issues.apache.org/jira/browse/SPARK-24959 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Major > > Currently JSON and CSV parsers are called even if required schema is empty. > Invoking the parser per each line has some non-zero overhead. The action can > be skipped. Such optimization should speed up count(), for example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26745: - Fix Version/s: 2.4.1 > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: correctness > Fix For: 2.4.1, 3.0.0 > > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-7721: --- Assignee: Hyukjin Kwon > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin >Assignee: Hyukjin Kwon >Priority: Major > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-7721. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23117 [https://github.com/apache/spark/pull/23117] > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26807) Confusing documentation regarding installation from PyPi
Emmanuel Arias created SPARK-26807: -- Summary: Confusing documentation regarding installation from PyPi Key: SPARK-26807 URL: https://issues.apache.org/jira/browse/SPARK-26807 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.4.0 Reporter: Emmanuel Arias Hello! I am new using Spark. Reading the documentation I think that is a little confusing on Downloading section. [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading] write: "Scala and Java users can include Spark in their projects using its Maven coordinates and in the future Python users can also install Spark from PyPI.", I interpret that currently Spark is not on PyPi yet. But [https://spark.apache.org/downloads.html] write: "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To install just run {{pip install pyspark}}." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26808) Pruned schema should not change nullability
[ https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26808: Assignee: (was: Apache Spark) > Pruned schema should not change nullability > --- > > Key: SPARK-26808 > URL: https://issues.apache.org/jira/browse/SPARK-26808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > We prune unnecessary nested fields from requested schema when reading > Parquet. Now seems we don't keep original nullability in pruned schema. We > should keep original nullability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26808) Pruned schema should not change nullability
[ https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757880#comment-16757880 ] Apache Spark commented on SPARK-26808: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/23719 > Pruned schema should not change nullability > --- > > Key: SPARK-26808 > URL: https://issues.apache.org/jira/browse/SPARK-26808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > We prune unnecessary nested fields from requested schema when reading > Parquet. Now seems we don't keep original nullability in pruned schema. We > should keep original nullability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26808) Pruned schema should not change nullability
[ https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26808: Assignee: Apache Spark > Pruned schema should not change nullability > --- > > Key: SPARK-26808 > URL: https://issues.apache.org/jira/browse/SPARK-26808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > We prune unnecessary nested fields from requested schema when reading > Parquet. Now seems we don't keep original nullability in pruned schema. We > should keep original nullability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26808) Pruned schema should not change nullability
Liang-Chi Hsieh created SPARK-26808: --- Summary: Pruned schema should not change nullability Key: SPARK-26808 URL: https://issues.apache.org/jira/browse/SPARK-26808 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We prune unnecessary nested fields from requested schema when reading Parquet. Now seems we don't keep original nullability in pruned schema. We should keep original nullability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26795: -- Flags: (was: Patch) Target Version/s: (was: 2.3.2, 2.4.0) Fix Version/s: (was: 2.3.2) (was: 2.3.1) (was: 2.4.0) (was: 2.3.0) (Don't set Fix, Target versions) > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > > There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26787) Fix standardization error message in WeightedLeastSquares
[ https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26787: - Assignee: Brian Scannell > Fix standardization error message in WeightedLeastSquares > - > > Key: SPARK-26787 > URL: https://issues.apache.org/jira/browse/SPARK-26787 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.0, 2.3.1, 2.4.0 > Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML > Beta. > >Reporter: Brian Scannell >Assignee: Brian Scannell >Priority: Trivial > > There is an error message in WeightedLeastSquares.scala that is incorrect and > thus not very helpful for diagnosing an issue. The problem arises when doing > regularized LinearRegression on a constant label. Even when the parameter > standardization=False, the error will falsely state that standardization was > set to True: > {{The standard deviation of the label is zero. Model cannot be regularized > with standardization=true}} > This is because under the hood, LinearRegression automatically sets a > parameter standardizeLabel=True. This was chosen for consistency with GLMNet, > although WeightedLeastSquares is written to allow standardizeLabel to be set > either way and work (although the public LinearRegression API does not allow > it). > > I will submit a pull request with my suggested wording. > > Relevant: > [https://github.com/apache/spark/pull/10702] > [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5] > > > The following Python code will replicate the error. > {code:java} > import pandas as pd > from pyspark.ml.feature import VectorAssembler > from pyspark.ml.regression import LinearRegression > df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]}) > spark_df = spark.createDataFrame(df) > vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = > 'features') > train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label']) > lr = LinearRegression(featuresCol='features', labelCol='label', > fitIntercept=False, standardization=False, regParam=1e-4) > lr_model = lr.fit(train_sdf) > {code} > > For context, the reason someone might want to do this is if they are trying > to fit a model to estimate components of a fixed total. The label indicates > the total is always 100%, but the components vary. For example, trying to > estimate the unknown weights of different quantities of substances in a > series of full bins. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25997. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22996 [https://github.com/apache/spark/pull/22996] > Python example code for Power Iteration Clustering in spark.ml > -- > > Key: SPARK-25997 > URL: https://issues.apache.org/jira/browse/SPARK-25997 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Add a python example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25997) Python example code for Power Iteration Clustering in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-25997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25997: - Assignee: Huaxin Gao > Python example code for Power Iteration Clustering in spark.ml > -- > > Key: SPARK-25997 > URL: https://issues.apache.org/jira/browse/SPARK-25997 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Add a python example code for Power iteration clustering in spark.ml examples -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display
[ https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hantiantian updated SPARK-26726: Description: The amount of memory used by the broadcast variable is not synchronized to the UI display, spark-sql> select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id; View the app's driver log: 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB) 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 2.5 GB) 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 2.5 GB) 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 2.5 GB) 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 2.5 GB) 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 2.5 GB) Web UI does not have the use of memory, ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump|| |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 B|0.0 B| | | |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 ms)|0.0 B|0.0 B|0.0 B| | | |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 B|0.0 B|0.0 B| | | |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 B|0.0 B| | | was: The amount of memory used by the broadcast variable is not synchronized to the UI display, spark-sql> select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id; View the app's driver log: 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB) 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: 2.5 GB) 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: 2.5 GB) 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: 2.5 GB) 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: 2.5 GB) 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: 2.5 GB) Web UI does not have the use of memory, ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump|| |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 B|0.0 B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout] [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]| |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 ms)|0.0 B|0.0 B|0.0 B| |[Thread Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]| |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 B|0.0 B|0.0 B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout] [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]| |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 B|0.0 B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout] [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]| > Synchronize the amount of memory used by the broadcast variable to the UI > display > --- > > Key: SPARK-26726 > URL: https://issues.apache.org/jira/browse/SPARK-26726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects
[jira] [Comment Edited] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)
[ https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757842#comment-16757842 ] Jungtaek Lim edited comment on SPARK-26783 at 2/1/19 1:01 AM: -- I'm not sure about what you're referring to here from SPARK-23685: {quote} Originally this pr was created as "failOnDataLoss" doesn't have any impact when set in structured streaming. But found out that ,the variable that needs to be used is "failondataloss" (all in lower case). {quote} I just played with test "failOnDataLoss=false should not return duplicated records: v1" in KafkaDontFailOnDataLossSuite, and looks like it works as intended (as case-insensitive manner). "failOnDataLoss" -> "false" // passed "failondataloss" -> "false" // passed "FAILONDATALOSS" -> "false" // passed // failed "failOnDataLoss" -> "true" // failed was (Author: kabhwan): I'm not sure about what [~sindiri] left a comment on the PR: {quote} Originally this pr was created as "failOnDataLoss" doesn't have any impact when set in structured streaming. But found out that ,the variable that needs to be used is "failondataloss" (all in lower case). {quote} I just played with test "failOnDataLoss=false should not return duplicated records: v1" in KafkaDontFailOnDataLossSuite, and looks like it works as intended (as case-insensitive manner). "failOnDataLoss" -> "false" // passed "failondataloss" -> "false" // passed "FAILONDATALOSS" -> "false" // passed // failed "failOnDataLoss" -> "true" // failed > Kafka parameter documentation doesn't match with the reality (upper/lowercase) > -- > > Key: SPARK-26783 > URL: https://issues.apache.org/jira/browse/SPARK-26783 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > A good example for this is "failOnDataLoss" which is reported in SPARK-23685. > I've just checked and there are several other parameters which suffer from > the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)
[ https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757842#comment-16757842 ] Jungtaek Lim commented on SPARK-26783: -- I'm not sure about what [~sindiri] left a comment on the PR: {quote} Originally this pr was created as "failOnDataLoss" doesn't have any impact when set in structured streaming. But found out that ,the variable that needs to be used is "failondataloss" (all in lower case). {quote} I just played with test "failOnDataLoss=false should not return duplicated records: v1" in KafkaDontFailOnDataLossSuite, and looks like it works as intended (as case-insensitive manner). "failOnDataLoss" -> "false" // passed "failondataloss" -> "false" // passed "FAILONDATALOSS" -> "false" // passed // failed "failOnDataLoss" -> "true" // failed > Kafka parameter documentation doesn't match with the reality (upper/lowercase) > -- > > Key: SPARK-26783 > URL: https://issues.apache.org/jira/browse/SPARK-26783 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > A good example for this is "failOnDataLoss" which is reported in SPARK-23685. > I've just checked and there are several other parameters which suffer from > the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26793) Remove spark.shuffle.manager
[ https://issues.apache.org/jira/browse/SPARK-26793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-26793. - Resolution: Invalid > Remove spark.shuffle.manager > > > Key: SPARK-26793 > URL: https://issues.apache.org/jira/browse/SPARK-26793 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Currently, `ShuffleManager` always uses `SortShuffleManager`, I think this > configuration can be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757828#comment-16757828 ] Robert Reid commented on SPARK-25136: - [~gsomogyi] I haven't had a chance to retry it. Our build system was changed since I last was working on this. So I'll need to rework some things to get that environment runnable. > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} > > Worker log entries > {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance >
[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)
[ https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757824#comment-16757824 ] Shixiong Zhu commented on SPARK-26783: -- [~gsomogyi] This seems just an API document issue. Right? If the user passes "failOnDataLoss", the Kafka source will pick it up correctly. > Kafka parameter documentation doesn't match with the reality (upper/lowercase) > -- > > Key: SPARK-26783 > URL: https://issues.apache.org/jira/browse/SPARK-26783 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > A good example for this is "failOnDataLoss" which is reported in SPARK-23685. > I've just checked and there are several other parameters which suffer from > the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26806: - Description: Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will make "avg" become "NaN". And whatever gets merged with the result of "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will return "0" and the user will see the following incorrect report: {code} "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } {code} This issue was reported by [~liancheng] was: Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will make "avg" become "NaN". And whatever gets merged with the result of "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will return "0" and the user will see the following incorrect report: {code} "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } {code} > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26806: Assignee: Apache Spark (was: Shixiong Zhu) > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26806: - Reporter: liancheng (was: Shixiong Zhu) > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: liancheng >Assignee: Shixiong Zhu >Priority: Major > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26806: Assignee: Shixiong Zhu (was: Apache Spark) > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26806: - Affects Version/s: 2.2.1 2.3.0 2.3.1 2.3.2 > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
Shixiong Zhu created SPARK-26806: Summary: EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly Key: SPARK-26806 URL: https://issues.apache.org/jira/browse/SPARK-26806 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will make "avg" become "NaN". And whatever gets merged with the result of "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will return "0" and the user will see the following incorrect report: {code} "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat
[ https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757814#comment-16757814 ] Apache Spark commented on SPARK-26654: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23662 > Use Timestamp/DateFormatter in CatalogColumnStat > > > Key: SPARK-26654 > URL: https://issues.apache.org/jira/browse/SPARK-26654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Need to switch fromExternalString on Timestamp/DateFormatters, in particular: > https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat
[ https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26654: Assignee: (was: Apache Spark) > Use Timestamp/DateFormatter in CatalogColumnStat > > > Key: SPARK-26654 > URL: https://issues.apache.org/jira/browse/SPARK-26654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Need to switch fromExternalString on Timestamp/DateFormatters, in particular: > https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat
[ https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26654: Assignee: Apache Spark > Use Timestamp/DateFormatter in CatalogColumnStat > > > Key: SPARK-26654 > URL: https://issues.apache.org/jira/browse/SPARK-26654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Need to switch fromExternalString on Timestamp/DateFormatters, in particular: > https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26805) Eliminate double checking of stringToDate and stringToTimestamp inputs
[ https://issues.apache.org/jira/browse/SPARK-26805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26805: Assignee: Apache Spark > Eliminate double checking of stringToDate and stringToTimestamp inputs > -- > > Key: SPARK-26805 > URL: https://issues.apache.org/jira/browse/SPARK-26805 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The LocalTime and LocalTime classes as well as other java.time classes used > inside of stringToTimestamp and stringToDate do checking of input parameters > already. There is no need to do that twice. The ticket aims to remove > isInvalidDate() from DateTimeUtils and eliminate such double checking. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26757) GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs
[ https://issues.apache.org/jira/browse/SPARK-26757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26757. --- Resolution: Fixed Fix Version/s: 2.3.4 2.4.1 3.0.0 Issue resolved by pull request 23681 [https://github.com/apache/spark/pull/23681] > GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs > > > Key: SPARK-26757 > URL: https://issues.apache.org/jira/browse/SPARK-26757 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1, 2.3.2, 2.4.0 >Reporter: Huon Wilson >Assignee: Huon Wilson >Priority: Minor > Fix For: 3.0.0, 2.4.1, 2.3.4 > > > The {{EdgeRDDImpl}} and {{VertexRDDImpl}} types provided by {{GraphX}} throw > an {{java.lang.UnsupportedOperationException: empty collection}} exception if > {{count}} is called on an empty instance, when they should return 0. > {code:scala} > import org.apache.spark.graphx.{Graph, Edge} > val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0) > graph.vertices.count > graph.edges.count > {code} > Running that code in a spark-shell: > {code:none} > scala> import org.apache.spark.graphx.{Graph, Edge} > import org.apache.spark.graphx.{Graph, Edge} > scala> val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0) > graph: org.apache.spark.graphx.Graph[Int,Unit] = > org.apache.spark.graphx.impl.GraphImpl@6879e983 > scala> graph.vertices.count > java.lang.UnsupportedOperationException: empty collection > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) > at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90) > ... 49 elided > scala> graph.edges.count > java.lang.UnsupportedOperationException: empty collection > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) > at org.apache.spark.graphx.impl.EdgeRDDImpl.count(EdgeRDDImpl.scala:90) > ... 49 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26805) Eliminate double checking of stringToDate and stringToTimestamp inputs
[ https://issues.apache.org/jira/browse/SPARK-26805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26805: Assignee: (was: Apache Spark) > Eliminate double checking of stringToDate and stringToTimestamp inputs > -- > > Key: SPARK-26805 > URL: https://issues.apache.org/jira/browse/SPARK-26805 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The LocalTime and LocalTime classes as well as other java.time classes used > inside of stringToTimestamp and stringToDate do checking of input parameters > already. There is no need to do that twice. The ticket aims to remove > isInvalidDate() from DateTimeUtils and eliminate such double checking. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26757) GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs
[ https://issues.apache.org/jira/browse/SPARK-26757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26757: - Assignee: Huon Wilson > GraphX EdgeRDDImpl and VertexRDDImpl `count` method cannot handle empty RDDs > > > Key: SPARK-26757 > URL: https://issues.apache.org/jira/browse/SPARK-26757 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1, 2.3.2, 2.4.0 >Reporter: Huon Wilson >Assignee: Huon Wilson >Priority: Minor > > The {{EdgeRDDImpl}} and {{VertexRDDImpl}} types provided by {{GraphX}} throw > an {{java.lang.UnsupportedOperationException: empty collection}} exception if > {{count}} is called on an empty instance, when they should return 0. > {code:scala} > import org.apache.spark.graphx.{Graph, Edge} > val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0) > graph.vertices.count > graph.edges.count > {code} > Running that code in a spark-shell: > {code:none} > scala> import org.apache.spark.graphx.{Graph, Edge} > import org.apache.spark.graphx.{Graph, Edge} > scala> val graph = Graph.fromEdges(sc.emptyRDD[Edge[Unit]], 0) > graph: org.apache.spark.graphx.Graph[Int,Unit] = > org.apache.spark.graphx.impl.GraphImpl@6879e983 > scala> graph.vertices.count > java.lang.UnsupportedOperationException: empty collection > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) > at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90) > ... 49 elided > scala> graph.edges.count > java.lang.UnsupportedOperationException: empty collection > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1031) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1031) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) > at org.apache.spark.graphx.impl.EdgeRDDImpl.count(EdgeRDDImpl.scala:90) > ... 49 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26805) Eliminate double checking of stringToDate and stringToTimestamp inputs
Maxim Gekk created SPARK-26805: -- Summary: Eliminate double checking of stringToDate and stringToTimestamp inputs Key: SPARK-26805 URL: https://issues.apache.org/jira/browse/SPARK-26805 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The LocalTime and LocalTime classes as well as other java.time classes used inside of stringToTimestamp and stringToDate do checking of input parameters already. There is no need to do that twice. The ticket aims to remove isInvalidDate() from DateTimeUtils and eliminate such double checking. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT
[ https://issues.apache.org/jira/browse/SPARK-26799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26799. - Resolution: Fixed Assignee: Chenxiao Mao Fix Version/s: 3.0.0 > Make ANTLR v4 version consistent between Maven and SBT > -- > > Key: SPARK-26799 > URL: https://issues.apache.org/jira/browse/SPARK-26799 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Minor > Fix For: 3.0.0 > > > Currently ANTLR v4 versions used by Maven and SBT are slightly different. > Maven uses 4.7.1 while SBT uses 4.7. > * Maven(pom.xml): 4.7.1 > * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7" > We should make Maven and SBT use a single version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue
[ https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26734: Assignee: (was: Apache Spark) > StackOverflowError on WAL serialization caused by large receivedBlockQueue > -- > > Key: SPARK-26734 > URL: https://issues.apache.org/jira/browse/SPARK-26734 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1, 2.3.2, 2.4.0 > Environment: spark 2.4.0 streaming job > java 1.8 > scala 2.11.12 >Reporter: Ross M. Lodge >Priority: Major > > We encountered an intermittent StackOverflowError with a stack trace similar > to: > > {noformat} > Exception in thread "JobGenerator" java.lang.StackOverflowError > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat} > The name of the thread has been seen to be either "JobGenerator" or > "streaming-start", depending on when in the lifecycle of the job the problem > occurs. It appears to only occur in streaming jobs with checkpointing and > WAL enabled; this has prevented us from upgrading to v2.4.0. > > Via debugging, we tracked this down to allocateBlocksToBatch in > ReceivedBlockTracker: > {code:java} > /** > * Allocate all unallocated blocks to the given batch. > * This event will get written to the write ahead log (if enabled). > */ > def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { > if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { > val streamIdToBlocks = streamIds.map { streamId => > (streamId, getReceivedBlockQueue(streamId).clone()) > }.toMap > val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) > if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { > streamIds.foreach(getReceivedBlockQueue(_).clear()) > timeToAllocatedBlocks.put(batchTime, allocatedBlocks) > lastAllocatedBatchTime = batchTime > } else { > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } else { > // This situation occurs when: > // 1. WAL is ended with BatchAllocationEvent, but without > BatchCleanupEvent, > // possibly processed batch job or half-processed batch job need to be > processed again, > // so the batchTime will be equal to lastAllocatedBatchTime. > // 2. Slow checkpointing makes recovered batch time older than WAL > recovered > // lastAllocatedBatchTime. > // This situation will only occurs in recovery time. > logInfo(s"Possibly processed batch $batchTime needs to be processed again > in WAL recovery") > } > } > {code} > Prior to 2.3.1, this code did > {code:java} > getReceivedBlockQueue(streamId).dequeueAll(x => true){code} > but it was changed as part of SPARK-23991 to > {code:java} > getReceivedBlockQueue(streamId).clone(){code} > We've not been able to reproduce this in a test of the actual above method, > but we've been able to produce a test that reproduces it by putting a lot of > values into the queue: > > {code:java} > class SerializationFailureTest extends FunSpec { > private val logger = LoggerFactory.getLogger(getClass) > private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] > describe("Queue") { > it("should be serializable") { > runTest(1062) > } > it("should not be serializable") { > runTest(1063) > } > it("should DEFINITELY not be serializable") { > runTest(199952) > } > } > private def runTest(mx: Int): Array[Byte] = { > try { > val random = new scala.util.Random() > val queue = new ReceivedBlockQueue() > for (_ <- 0 until mx) { > queue += ReceivedBlockInfo( > streamId = 0, > numRecords = Some(random.nextInt(5)), > metadataOption = None, > blockStoreResult = WriteAheadLogBasedStoreResult( > blockId = StreamBlockId(0, random.nextInt()), > numRecords = Some(random.nextInt(5)), > walRecordHandle = FileBasedWriteAheadLogSegment( > path = > s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""", > offset = random.nextLong(), > length = random.nextInt() > ) > ) > ) > } > val record = BatchAllocationEvent( > Time(154832040L),
[jira] [Assigned] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue
[ https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26734: Assignee: Apache Spark > StackOverflowError on WAL serialization caused by large receivedBlockQueue > -- > > Key: SPARK-26734 > URL: https://issues.apache.org/jira/browse/SPARK-26734 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1, 2.3.2, 2.4.0 > Environment: spark 2.4.0 streaming job > java 1.8 > scala 2.11.12 >Reporter: Ross M. Lodge >Assignee: Apache Spark >Priority: Major > > We encountered an intermittent StackOverflowError with a stack trace similar > to: > > {noformat} > Exception in thread "JobGenerator" java.lang.StackOverflowError > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat} > The name of the thread has been seen to be either "JobGenerator" or > "streaming-start", depending on when in the lifecycle of the job the problem > occurs. It appears to only occur in streaming jobs with checkpointing and > WAL enabled; this has prevented us from upgrading to v2.4.0. > > Via debugging, we tracked this down to allocateBlocksToBatch in > ReceivedBlockTracker: > {code:java} > /** > * Allocate all unallocated blocks to the given batch. > * This event will get written to the write ahead log (if enabled). > */ > def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { > if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { > val streamIdToBlocks = streamIds.map { streamId => > (streamId, getReceivedBlockQueue(streamId).clone()) > }.toMap > val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) > if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { > streamIds.foreach(getReceivedBlockQueue(_).clear()) > timeToAllocatedBlocks.put(batchTime, allocatedBlocks) > lastAllocatedBatchTime = batchTime > } else { > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } else { > // This situation occurs when: > // 1. WAL is ended with BatchAllocationEvent, but without > BatchCleanupEvent, > // possibly processed batch job or half-processed batch job need to be > processed again, > // so the batchTime will be equal to lastAllocatedBatchTime. > // 2. Slow checkpointing makes recovered batch time older than WAL > recovered > // lastAllocatedBatchTime. > // This situation will only occurs in recovery time. > logInfo(s"Possibly processed batch $batchTime needs to be processed again > in WAL recovery") > } > } > {code} > Prior to 2.3.1, this code did > {code:java} > getReceivedBlockQueue(streamId).dequeueAll(x => true){code} > but it was changed as part of SPARK-23991 to > {code:java} > getReceivedBlockQueue(streamId).clone(){code} > We've not been able to reproduce this in a test of the actual above method, > but we've been able to produce a test that reproduces it by putting a lot of > values into the queue: > > {code:java} > class SerializationFailureTest extends FunSpec { > private val logger = LoggerFactory.getLogger(getClass) > private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] > describe("Queue") { > it("should be serializable") { > runTest(1062) > } > it("should not be serializable") { > runTest(1063) > } > it("should DEFINITELY not be serializable") { > runTest(199952) > } > } > private def runTest(mx: Int): Array[Byte] = { > try { > val random = new scala.util.Random() > val queue = new ReceivedBlockQueue() > for (_ <- 0 until mx) { > queue += ReceivedBlockInfo( > streamId = 0, > numRecords = Some(random.nextInt(5)), > metadataOption = None, > blockStoreResult = WriteAheadLogBasedStoreResult( > blockId = StreamBlockId(0, random.nextInt()), > numRecords = Some(random.nextInt(5)), > walRecordHandle = FileBasedWriteAheadLogSegment( > path = > s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""", > offset = random.nextLong(), > length = random.nextInt() > ) > ) > ) > } > val record = BatchAllocationEvent( >
[jira] [Issue Comment Deleted] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24432: --- Comment: was deleted (was: vanzin closed pull request #22722: [SPARK-24432][k8s] Add support for dynamic resource allocation on Kubernetes URL: https://github.com/apache/spark/pull/22722 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/kubernetes/KubernetesExternalShuffleClient.java b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/kubernetes/KubernetesExternalShuffleClient.java new file mode 100644 index 0..7135d1af5facd --- /dev/null +++ b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/kubernetes/KubernetesExternalShuffleClient.java @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.network.shuffle.kubernetes; + +import com.google.common.util.concurrent.ThreadFactoryBuilder; +import org.apache.spark.network.client.RpcResponseCallback; +import org.apache.spark.network.client.TransportClient; +import org.apache.spark.network.sasl.SecretKeyHolder; +import org.apache.spark.network.shuffle.ExternalShuffleClient; +import org.apache.spark.network.shuffle.protocol.RegisterDriver; +import org.apache.spark.network.shuffle.protocol.ShuffleServiceHeartbeat; +import org.apache.spark.network.util.TransportConf; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; + +/** + * A client for talking to the external shuffle service in Kubernetes cluster mode. + * + * This is used by the each Spark executor to register with a corresponding external + * shuffle service on the cluster. The purpose is for cleaning up shuffle files + * reliably if the application exits unexpectedly. + */ +public class KubernetesExternalShuffleClient extends ExternalShuffleClient { + private static final Logger logger = LoggerFactory +.getLogger(KubernetesExternalShuffleClient.class); + + private final ScheduledExecutorService heartbeaterThread = +Executors.newSingleThreadScheduledExecutor( + new ThreadFactoryBuilder() +.setDaemon(true) +.setNameFormat("kubernetes-external-shuffle-client-heartbeater") +.build()); + + /** + * Creates a Kubernetes external shuffle client that wraps the {@link ExternalShuffleClient}. + * Please refer to docs on {@link ExternalShuffleClient} for more information. + */ + public KubernetesExternalShuffleClient( + TransportConf conf, + SecretKeyHolder secretKeyHolder, + boolean saslEnabled, + long registrationTimeoutMs) { +super(conf, secretKeyHolder, saslEnabled, registrationTimeoutMs); + } + + public void registerDriverWithShuffleService( + String host, + int port, + long heartbeatTimeoutMs, + long heartbeatIntervalMs) throws IOException, InterruptedException { +checkInit(); +ByteBuffer registerDriver = new RegisterDriver(appId, heartbeatTimeoutMs).toByteBuffer(); +TransportClient client = clientFactory.createClient(host, port); +client.sendRpc(registerDriver, new RegisterDriverCallback(client, heartbeatIntervalMs)); + } + + @Override + public void close() { +heartbeaterThread.shutdownNow(); +super.close(); + } + + private class RegisterDriverCallback implements RpcResponseCallback { +private final TransportClient client; +private final long heartbeatIntervalMs; + +private RegisterDriverCallback(TransportClient client, long heartbeatIntervalMs) { + this.client = client; + this.heartbeatIntervalMs = heartbeatIntervalMs; +} + +@Override +public void onSuccess(ByteBuffer response) { +
[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24432: -- Assignee: Marcelo Vanzin > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Yinan Li >Assignee: Marcelo Vanzin >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24432: -- Assignee: (was: Marcelo Vanzin) > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26804) Spark sql carries newline char from last csv column when imported
Raj created SPARK-26804: --- Summary: Spark sql carries newline char from last csv column when imported Key: SPARK-26804 URL: https://issues.apache.org/jira/browse/SPARK-26804 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Raj I am trying to generate external sql tables in DataBricks using Spark sql query. Below is my query. The query reads csv file and creates external table but it carries the newline char while creating the last column. Is there a way to resolve this issue? %sql create table if not exists <> using CSV options ("header"="true", "inferschema"="true","multiLine"="true", "escape"='"') location -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26803) include sbin subdirectory in pyspark
[ https://issues.apache.org/jira/browse/SPARK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26803: Assignee: (was: Apache Spark) > include sbin subdirectory in pyspark > > > Key: SPARK-26803 > URL: https://issues.apache.org/jira/browse/SPARK-26803 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Oliver Urs Lenz >Priority: Minor > Labels: build, easyfix, newbie, ready-to-commit, usability > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > At present, the `sbin` subdirectory is not included in Pyspark. This seems to > be unintentional, and it makes e.g. the discussion at > [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html] > confusing. In fact, the sbin directory can be downloaded manually into the > pyspark directory, but it would be nice if it were included automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26803) include sbin subdirectory in pyspark
[ https://issues.apache.org/jira/browse/SPARK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26803: -- Shepherd: (was: Sean Owen) > include sbin subdirectory in pyspark > > > Key: SPARK-26803 > URL: https://issues.apache.org/jira/browse/SPARK-26803 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Oliver Urs Lenz >Priority: Minor > Labels: build, easyfix, newbie, ready-to-commit, usability > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > At present, the `sbin` subdirectory is not included in Pyspark. This seems to > be unintentional, and it makes e.g. the discussion at > [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html] > confusing. In fact, the sbin directory can be downloaded manually into the > pyspark directory, but it would be nice if it were included automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26803) include sbin subdirectory in pyspark
[ https://issues.apache.org/jira/browse/SPARK-26803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26803: Assignee: Apache Spark > include sbin subdirectory in pyspark > > > Key: SPARK-26803 > URL: https://issues.apache.org/jira/browse/SPARK-26803 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Oliver Urs Lenz >Assignee: Apache Spark >Priority: Minor > Labels: build, easyfix, newbie, ready-to-commit, usability > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > At present, the `sbin` subdirectory is not included in Pyspark. This seems to > be unintentional, and it makes e.g. the discussion at > [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html] > confusing. In fact, the sbin directory can be downloaded manually into the > pyspark directory, but it would be nice if it were included automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26803) include sbin subdirectory in pyspark
Oliver Urs Lenz created SPARK-26803: --- Summary: include sbin subdirectory in pyspark Key: SPARK-26803 URL: https://issues.apache.org/jira/browse/SPARK-26803 Project: Spark Issue Type: Improvement Components: Build, PySpark Affects Versions: 2.4.0, 3.0.0 Reporter: Oliver Urs Lenz At present, the `sbin` subdirectory is not included in Pyspark. This seems to be unintentional, and it makes e.g. the discussion at [https://spark.apache.org/docs/latest/monitoring.html|https://spark.apache.org/docs/latest/monitoring.html] confusing. In fact, the sbin directory can be downloaded manually into the pyspark directory, but it would be nice if it were included automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.
[ https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757663#comment-16757663 ] Oleg Frenkel commented on SPARK-24736: -- Supporting local files with --py-files would be great. By "local files" I mean files that are coming from the original machine outside of driver pod and not files available in the driver docker container. > --py-files not functional for non local URLs. It appears to pass non-local > URL's into PYTHONPATH directly. > -- > > Key: SPARK-24736 > URL: https://issues.apache.org/jira/browse/SPARK-24736 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 > Environment: Recent 2.4.0 from master branch, submitted on Linux to a > KOPS Kubernetes cluster created on AWS. > >Reporter: Jonathan A Weaver >Priority: Minor > > My spark-submit > bin/spark-submit \ > --master > k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/] > \ > --deploy-mode cluster \ > --name pytest \ > --conf > spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest] > \ > --conf > [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver > \ > --conf > spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/] > \ > --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \ > --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \ > --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \ > [https://s3.amazonaws.com/maxar-ids-fids/it.py] > > *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()* > 2018-07-01 07:33:43 INFO SparkContext:54 - Added file > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp > 1530430423297 > 2018-07-01 07:33:43 INFO Utils:54 - Fetching > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to > /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp > *I print out the PYTHONPATH and PYSPARK_FILES environment variables from the > driver script:* > PYTHONPATH > /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]* > PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] > > *I print out sys.path* > ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', > u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240', > '/opt/spark/python/lib/pyspark.zip', > '/opt/spark/python/lib/py4j-0.10.7-src.zip', > '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', > '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', > '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',* > '/usr/lib/python27.zip', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/lib/python2.7/site-packages'] > > *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.* > > *Dump of spark config from container.* > Spark config dumped: > [(u'spark.master', > u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'), > (u'spark.kubernetes.authenticate.submission.oauthToken', > u''), > (u'spark.kubernetes.authenticate.driver.oauthToken', > u''), (u'spark.kubernetes.executor.podNamePrefix', > u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), > (u'spark.driver.blockManager.port', u'7079'), > (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), > (u'[spark.app.name|http://spark.app.name/]', u'pytest'), > (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), > (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), > (u'spark.kubernetes.container.image', >
[jira] [Updated] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display
[ https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-26726: --- Fix Version/s: 2.3.3 > Synchronize the amount of memory used by the broadcast variable to the UI > display > --- > > Key: SPARK-26726 > URL: https://issues.apache.org/jira/browse/SPARK-26726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > The amount of memory used by the broadcast variable is not synchronized to > the UI display, > spark-sql> select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id; > View the app's driver log: > 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB) > 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: > 2.5 GB) > 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: > 2.5 GB) > > Web UI does not have the use of memory, > ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk > Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task > Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump|| > |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 > B|0.0 > B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout] > > [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]| > |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 > ms)|0.0 B|0.0 B|0.0 B| |[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]| > |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 > B|0.0 B|0.0 > B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout] > > [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]| > |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 > B|0.0 > B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout] > > [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability
[ https://issues.apache.org/jira/browse/SPARK-26802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-26802: - Description: Severity: Important Vendor: The Apache Software Foundation Versions affected: All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions Spark 2.2.0 to 2.2.2 Spark 2.3.0 to 2.3.1 Description: When using PySpark , it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. Mitigation: 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer 2.3.x users should upgrade to 2.3.2 or newer Otherwise, affected users should avoid using PySpark in multi-user environments. Credit: This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. References: https://spark.apache.org/security.html This was fixed by https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 (master / 2.4) https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 (branch-2.3) https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 (branch-2.2) was: Severity: Important Vendor: The Apache Software Foundation Versions affected: All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions Spark 2.2.0 to 2.2.2 Spark 2.3.0 to 2.3.1 Description: When using PySpark , it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. Mitigation: 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer 2.3.x users should upgrade to 2.3.2 or newer Otherwise, affected users should avoid using PySpark in multi-user environments. Credit: This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. References: https://spark.apache.org/security.html This was fixed by https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 > CVE-2018-11760: Apache Spark local privilege escalation vulnerability > - > > Key: SPARK-26802 > URL: https://issues.apache.org/jira/browse/SPARK-26802 > Project: Spark > Issue Type: Bug > Components: PySpark, Security >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2 >Reporter: Imran Rashid >Assignee: Luca Canali >Priority: Blocker > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > Severity: Important > Vendor: The Apache Software Foundation > Versions affected: > All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions > Spark 2.2.0 to 2.2.2 > Spark 2.3.0 to 2.3.1 > Description: > When using PySpark , it's possible for a different local user to connect to > the Spark application and impersonate the user running the Spark application. > This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. > Mitigation: > 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer > 2.3.x users should upgrade to 2.3.2 or newer > Otherwise, affected users should avoid using PySpark in multi-user > environments. > Credit: > This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. > References: > https://spark.apache.org/security.html > This was fixed by > https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 > (master / 2.4) > https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 > (branch-2.3) > https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 > (branch-2.2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability
Imran Rashid created SPARK-26802: Summary: CVE-2018-11760: Apache Spark local privilege escalation vulnerability Key: SPARK-26802 URL: https://issues.apache.org/jira/browse/SPARK-26802 Project: Spark Issue Type: Bug Components: Security, PySpark Affects Versions: 2.2.2, 2.1.3, 2.0.2, 1.6.3 Reporter: Imran Rashid Assignee: Luca Canali Fix For: 2.4.0, 2.3.2, 2.2.3 Severity: Important Vendor: The Apache Software Foundation Versions affected: All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions Spark 2.2.0 to 2.2.2 Spark 2.3.0 to 2.3.1 Description: When using PySpark , it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. Mitigation: 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer 2.3.x users should upgrade to 2.3.2 or newer Otherwise, affected users should avoid using PySpark in multi-user environments. Credit: This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. References: https://spark.apache.org/security.html This was fixed by https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability
[ https://issues.apache.org/jira/browse/SPARK-26802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-26802: - Description: Severity: Important Vendor: The Apache Software Foundation Versions affected: All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions Spark 2.2.0 to 2.2.2 Spark 2.3.0 to 2.3.1 Description: When using PySpark , it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. Mitigation: 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer 2.3.x users should upgrade to 2.3.2 or newer Otherwise, affected users should avoid using PySpark in multi-user environments. Credit: This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. References: https://spark.apache.org/security.html This was fixed by master / 2.4: https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 https://github.com/apache/spark/commit/0df6bf882907d7d76572f513168a144067d0e0ec branch-2.3 https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe branch-2.2 https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 branch-2.1 (note that this does not exist in any release) https://github.com/apache/spark/commit/b2e0f68f615cbe2cf74f9813ece76c311fe8e911 was: Severity: Important Vendor: The Apache Software Foundation Versions affected: All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions Spark 2.2.0 to 2.2.2 Spark 2.3.0 to 2.3.1 Description: When using PySpark , it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. Mitigation: 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer 2.3.x users should upgrade to 2.3.2 or newer Otherwise, affected users should avoid using PySpark in multi-user environments. Credit: This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. References: https://spark.apache.org/security.html This was fixed by https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 (master / 2.4) https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 (branch-2.3) https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 (branch-2.2) > CVE-2018-11760: Apache Spark local privilege escalation vulnerability > - > > Key: SPARK-26802 > URL: https://issues.apache.org/jira/browse/SPARK-26802 > Project: Spark > Issue Type: Bug > Components: PySpark, Security >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2 >Reporter: Imran Rashid >Assignee: Luca Canali >Priority: Blocker > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > Severity: Important > Vendor: The Apache Software Foundation > Versions affected: > All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions > Spark 2.2.0 to 2.2.2 > Spark 2.3.0 to 2.3.1 > Description: > When using PySpark , it's possible for a different local user to connect to > the Spark application and impersonate the user running the Spark application. > This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. > Mitigation: > 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer > 2.3.x users should upgrade to 2.3.2 or newer > Otherwise, affected users should avoid using PySpark in multi-user > environments. > Credit: > This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. > References: > https://spark.apache.org/security.html > This was fixed by > master / 2.4: > https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 > https://github.com/apache/spark/commit/0df6bf882907d7d76572f513168a144067d0e0ec > branch-2.3 > https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe > branch-2.2 > https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 > > branch-2.1 (note that this does not exist in any release) > https://github.com/apache/spark/commit/b2e0f68f615cbe2cf74f9813ece76c311fe8e911 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26802) CVE-2018-11760: Apache Spark local privilege escalation vulnerability
[ https://issues.apache.org/jira/browse/SPARK-26802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-26802. -- Resolution: Fixed > CVE-2018-11760: Apache Spark local privilege escalation vulnerability > - > > Key: SPARK-26802 > URL: https://issues.apache.org/jira/browse/SPARK-26802 > Project: Spark > Issue Type: Bug > Components: PySpark, Security >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2 >Reporter: Imran Rashid >Assignee: Luca Canali >Priority: Blocker > Fix For: 2.4.0, 2.3.2, 2.2.3 > > > Severity: Important > Vendor: The Apache Software Foundation > Versions affected: > All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions > Spark 2.2.0 to 2.2.2 > Spark 2.3.0 to 2.3.1 > Description: > When using PySpark , it's possible for a different local user to connect to > the Spark application and impersonate the user running the Spark application. > This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. > Mitigation: > 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer > 2.3.x users should upgrade to 2.3.2 or newer > Otherwise, affected users should avoid using PySpark in multi-user > environments. > Credit: > This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. > References: > https://spark.apache.org/security.html > This was fixed by > https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650 > https://github.com/apache/spark/commit/8080c937d3752aee2fd36f0045a057f7130f6fe4 > https://github.com/apache/spark/commit/a5624c7ae29d6d49117dd78642879bf978212d30 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display
[ https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26726: -- Assignee: hantiantian > Synchronize the amount of memory used by the broadcast variable to the UI > display > --- > > Key: SPARK-26726 > URL: https://issues.apache.org/jira/browse/SPARK-26726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > > The amount of memory used by the broadcast variable is not synchronized to > the UI display, > spark-sql> select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id; > View the app's driver log: > 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB) > 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: > 2.5 GB) > 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: > 2.5 GB) > > Web UI does not have the use of memory, > ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk > Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task > Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump|| > |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 > B|0.0 > B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout] > > [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]| > |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 > ms)|0.0 B|0.0 B|0.0 B| |[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]| > |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 > B|0.0 B|0.0 > B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout] > > [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]| > |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 > B|0.0 > B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout] > > [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26801) Spark unable to read valid avro types
Dhruve Ashar created SPARK-26801: Summary: Spark unable to read valid avro types Key: SPARK-26801 URL: https://issues.apache.org/jira/browse/SPARK-26801 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Dhruve Ashar Currently the external avro package reads avro schemas for type records only. This is probably because of representation of InternalRow in spark sql. As a result, if the avro file has anything other than a sequence of records it fails to read it. We faced this issue earlier while trying to read primitive types. We encountered this again while trying to read an array of records. Below are code examples trying to read valid avro data showing the stack traces. {code:java} spark.read.format("avro").load("avroTypes/randomInt.avro").show java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL StructType: "int" at org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) ... 49 elided == scala> spark.read.format("avro").load("avroTypes/randomEnum.avro").show java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL StructType: { "type" : "enum", "name" : "Suit", "symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ] } at org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) ... 49 elided {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26726) Synchronize the amount of memory used by the broadcast variable to the UI display
[ https://issues.apache.org/jira/browse/SPARK-26726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26726. Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 Issue resolved by pull request 23649 [https://github.com/apache/spark/pull/23649] > Synchronize the amount of memory used by the broadcast variable to the UI > display > --- > > Key: SPARK-26726 > URL: https://issues.apache.org/jira/browse/SPARK-26726 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0, 2.4.1 > > > The amount of memory used by the broadcast variable is not synchronized to > the UI display, > spark-sql> select /*+ broadcast(a)*/ a.id,b.id from a join b on a.id = b.id; > View the app's driver log: > 2019-01-25 16:45:23,726 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_4_piece0 in memory on 10.43.xx.xx:33907 (size: 6.6 KB, free: 2.5 GB) > 2019-01-25 16:45:23,727 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_4_piece0 in memory on 10.43.xx.xx:38399 (size: 6.6 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,745 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_3_piece0 in memory on 10.43.xx.xx:33907 (size: 32.1 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,749 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_3_piece0 in memory on 10.43.xx.xx:38399 (size: 32.1 KB, free: > 2.5 GB) > 2019-01-25 16:45:23,838 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_2_piece0 in memory on 10.43.xx.xx:38399 (size: 147.0 B, free: > 2.5 GB) > 2019-01-25 16:45:23,840 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_2_piece0 in memory on 10.43.xx.xx:33907 (size: 147.0 B, free: > 2.5 GB) > > Web UI does not have the use of memory, > ||Executor ID||Address||Status||RDD Blocks||Storage Memory||Disk > Used||Cores||Active Tasks||Failed Tasks||Complete Tasks||Total Tasks||Task > Time (GC Time)||Input||Shuffle Read||Shuffle Write||Logs||Thread Dump|| > |0|xxx:38399|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.4 s)|8 B|0.0 > B|0.0 > B|[stdout|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stdout] > > [stderr|http://10.43.183.120:18085/logPage/?appId=app-20190125164426-0003=0=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=0]| > |driver|xxx:47936|Active|0|0.0 B / 384.1 MB|0.0 B|0|0|0|0|0|0.0 ms (0.0 > ms)|0.0 B|0.0 B|0.0 B| |[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=driver]| > |1|xxx:47414|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|0|0|0.0 ms (0.0 ms)|0.0 > B|0.0 B|0.0 > B|[stdout|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stdout] > > [stderr|http://10.43.183.121:18085/logPage/?appId=app-20190125164426-0003=1=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=1]| > |2|xxx:33907|Active|0|0.0 B / 2.7 GB|0.0 B|1|0|0|2|2|4 s (0.2 s)|4 B|0.0 > B|0.0 > B|[stdout|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stdout] > > [stderr|http://10.43.183.122:18085/logPage/?appId=app-20190125164426-0003=2=stderr]|[Thread > Dump|http://10.43.183.121:4040/executors/threadDump/?executorId=2]| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26744) Support schema validation in File Source V2
[ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26744: Assignee: (was: Apache Spark) > Support schema validation in File Source V2 > --- > > Key: SPARK-26744 > URL: https://issues.apache.org/jira/browse/SPARK-26744 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26744) Support schema validation in File Source V2
[ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26744: Assignee: Apache Spark > Support schema validation in File Source V2 > --- > > Key: SPARK-26744 > URL: https://issues.apache.org/jira/browse/SPARK-26744 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26744) Support schema validation in File Source V2
[ https://issues.apache.org/jira/browse/SPARK-26744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26744: --- Summary: Support schema validation in File Source V2 (was: Create API supportDataType in File Source V2 framework) > Support schema validation in File Source V2 > --- > > Key: SPARK-26744 > URL: https://issues.apache.org/jira/browse/SPARK-26744 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT
[ https://issues.apache.org/jira/browse/SPARK-26799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26799: Assignee: (was: Apache Spark) > Make ANTLR v4 version consistent between Maven and SBT > -- > > Key: SPARK-26799 > URL: https://issues.apache.org/jira/browse/SPARK-26799 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Minor > > Currently ANTLR v4 versions used by Maven and SBT are slightly different. > Maven uses 4.7.1 while SBT uses 4.7. > * Maven(pom.xml): 4.7.1 > * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7" > We should make Maven and SBT use a single version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26800) JDBC - MySQL nullable option is ignored
[ https://issues.apache.org/jira/browse/SPARK-26800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francisco Miguel Biete Banon updated SPARK-26800: - Description: Spark 2.4.0 MySQL 5.7.21 (docker official MySQL image running with default config) Writing a dataframe with optionally null fields result in a table with NOT NULL attributes in MySQL. {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Row, SaveMode} import java.sql.Timestamp val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York")) val schema = StructType( StructField("id", IntegerType, true) :: StructField("when", TimestampType, true) :: StructField("city", StringType, true) :: Nil) println(schema.toDDL) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", jdbcProperties){code} Produces {code} CREATE TABLE `temp_bug` ( `id` int(11) DEFAULT NULL, `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `city` text ) ENGINE=InnoDB DEFAULT CHARSET=latin1; {code} I would expect "when" column to be defined as nullable. was: Spark 2.4.0 MySQL 5.7.21 (docker official MySQL image running with default config) Writing a dataframe with optionally null fields result in a table with NOT NULL attributes in MySQL. {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Row, SaveMode} import java.sql.Timestamp val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York")) val schema = StructType( StructField("id", IntegerType) :: StructField("when", TimestampType, true) :: StructField("city", StringType) :: Nil) println(schema.toDDL) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", jdbcProperties){code} Produces {code} CREATE TABLE `temp_bug` ( `id` int(11) DEFAULT NULL, `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `city` text ) ENGINE=InnoDB DEFAULT CHARSET=latin1; {code} I would expect "when" column to be defined as nullable. > JDBC - MySQL nullable option is ignored > --- > > Key: SPARK-26800 > URL: https://issues.apache.org/jira/browse/SPARK-26800 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Francisco Miguel Biete Banon >Priority: Minor > > Spark 2.4.0 > MySQL 5.7.21 (docker official MySQL image running with default config) > Writing a dataframe with optionally null fields result in a table with NOT > NULL attributes in MySQL. > {code:java} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{Row, SaveMode} > import java.sql.Timestamp > val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York")) > val schema = StructType( > StructField("id", IntegerType, true) :: > StructField("when", TimestampType, true) :: > StructField("city", StringType, true) :: Nil) > println(schema.toDDL) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", > jdbcProperties){code} > Produces > {code} > CREATE TABLE `temp_bug` ( > `id` int(11) DEFAULT NULL, > `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE > CURRENT_TIMESTAMP, > `city` text > ) ENGINE=InnoDB DEFAULT CHARSET=latin1; > {code} > I would expect "when" column to be defined as nullable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT
[ https://issues.apache.org/jira/browse/SPARK-26799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26799: Assignee: Apache Spark > Make ANTLR v4 version consistent between Maven and SBT > -- > > Key: SPARK-26799 > URL: https://issues.apache.org/jira/browse/SPARK-26799 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Apache Spark >Priority: Minor > > Currently ANTLR v4 versions used by Maven and SBT are slightly different. > Maven uses 4.7.1 while SBT uses 4.7. > * Maven(pom.xml): 4.7.1 > * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7" > We should make Maven and SBT use a single version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26800) JDBC - MySQL nullable option is ignored
Francisco Miguel Biete Banon created SPARK-26800: Summary: JDBC - MySQL nullable option is ignored Key: SPARK-26800 URL: https://issues.apache.org/jira/browse/SPARK-26800 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Francisco Miguel Biete Banon Spark 2.4.0 MySQL 5.7.21 (docker official MySQL image running with default config) Writing a dataframe with optionally null fields result in a table with NOT NULL attributes in MySQL. {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Row, SaveMode} import java.sql.Timestamp val data = Seq[Row](Row(1, null, "Boston"), Row(2, null, "New York")) val schema = StructType( StructField("id", IntegerType) :: StructField("when", TimestampType, true) :: StructField("city", StringType) :: Nil) println(schema.toDDL) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.write.mode(SaveMode.Overwrite).jdbc(jdbcUrl, "temp_bug", jdbcProperties){code} Produces {code} CREATE TABLE `temp_bug` ( `id` int(11) DEFAULT NULL, `when` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `city` text ) ENGINE=InnoDB DEFAULT CHARSET=latin1; {code} I would expect "when" column to be defined as nullable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT
Chenxiao Mao created SPARK-26799: Summary: Make ANTLR v4 version consistent between Maven and SBT Key: SPARK-26799 URL: https://issues.apache.org/jira/browse/SPARK-26799 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0 Reporter: Chenxiao Mao Currently ANTLR v4 versions used by Maven and SBT are slightly different. Maven uses 4.7.1 while SBT uses 4.7. * Maven(pom.xml): 4.7.1 * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7" We should make Maven and SBT use a single version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26798) HandleNullInputsForUDF should trust nullability
[ https://issues.apache.org/jira/browse/SPARK-26798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26798: Assignee: Wenchen Fan (was: Apache Spark) > HandleNullInputsForUDF should trust nullability > --- > > Key: SPARK-26798 > URL: https://issues.apache.org/jira/browse/SPARK-26798 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26798) HandleNullInputsForUDF should trust nullability
[ https://issues.apache.org/jira/browse/SPARK-26798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26798: Assignee: Apache Spark (was: Wenchen Fan) > HandleNullInputsForUDF should trust nullability > --- > > Key: SPARK-26798 > URL: https://issues.apache.org/jira/browse/SPARK-26798 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26798) HandleNullInputsForUDF should trust nullability
Wenchen Fan created SPARK-26798: --- Summary: HandleNullInputsForUDF should trust nullability Key: SPARK-26798 URL: https://issues.apache.org/jira/browse/SPARK-26798 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757326#comment-16757326 ] Nicholas Resnick commented on SPARK-17998: -- Going to answer my question: it is in fact a function of the number of cores available. The key definition is here: [https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86] ``` Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) ``` The first two values are configurable, and the third is roughly the size of the file divided by the number of cores (according to sc.defaultParallelism.) If defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will equal bytesPerCore, which means we should expect around one partition for each core. > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes >Priority: Major > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757326#comment-16757326 ] Nicholas Resnick edited comment on SPARK-17998 at 1/31/19 3:03 PM: --- Going to answer my question: maxSplitBytes is in fact a function of the number of cores available. The key definition is here: [https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86] {code:java} Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) {code} The first two values are configurable, and the third is roughly the size of the file divided by the number of cores (according to sc.defaultParallelism.) If defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will equal bytesPerCore, which means we should expect around one partition for each core. was (Author: nresnick): Going to answer my question: it is in fact a function of the number of cores available. The key definition is here: [https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86] {code:java} Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) {code} The first two values are configurable, and the third is roughly the size of the file divided by the number of cores (according to sc.defaultParallelism.) If defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will equal bytesPerCore, which means we should expect around one partition for each core. > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes >Priority: Major > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv
[ https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757327#comment-16757327 ] vishnuram selvaraj commented on SPARK-26786: We should be able to treat the escaped newlines very similar to how we are treating the escaped quotes in the data when reading the file into a dataframe. Just like how the python's CSV reader is removing the escape characters infront of both quote and newline characters > Handle to treat escaped newline characters('\r','\n') in spark csv > -- > > Key: SPARK-26786 > URL: https://issues.apache.org/jira/browse/SPARK-26786 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, SQL >Affects Versions: 2.3.0 >Reporter: vishnuram selvaraj >Priority: Major > > There are some systems like AWS redshift which writes csv files by escaping > newline characters('\r','\n') in addition to escaping the quote characters, > if they come as part of the data. > Redshift documentation > link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and > below is their mention of escaping requirements in the mentioned link > ESCAPE > For CHAR and VARCHAR columns in delimited unload files, an escape character > (\{{}}) is placed before every occurrence of the following characters: > * Linefeed: {{\n}} > * Carriage return: {{\r}} > * The delimiter character specified for the unloaded data. > * The escape character: \{{}} > * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are > specified in the UNLOAD command). > > *Problem statement:* > But the spark CSV reader doesn't have a handle to treat/remove the escape > characters infront of the newline characters in the data. > It would really help if we can add a feature to handle the escaped newline > characters through another parameter like (escapeNewline = 'true/false'). > *Example:* > Below are the details of my test data set up in a file. > * The first record in that file has escaped windows newline character ( > r > n) > * The third record in that file has escaped unix newline character ( > n) > * The fifth record in that file has the escaped quote character (") > the file looks like below in vi editor: > > {code:java} > "1","this is \^M\ > line1"^M > "2","this is line2"^M > "3","this is \ > line3"^M > "4","this is \" line4"^M > "5","this is line5"^M{code} > > When I read the file in python's csv module with escape, it is able to remove > the added escape characters as you can see below, > > {code:java} > >>> with open('/tmp/test3.csv','r') as readCsv: > ... readFile = > csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False) > ... for row in readFile: > ... print(row) > ... > ['1', 'this is \r\n line1'] > ['2', 'this is line2'] > ['3', 'this is \n line3'] > ['4', 'this is " line4'] > ['5', 'this is line5'] > {code} > But if I read the same file in spark-csv reader, the escape characters > infront of the newline characters are not removed.But the escape before the > (") is removed. > {code:java} > >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false') > >>> redDf.show() > +---+--+ > |_c0| _c1| > +---+--+ > \ 1|this is \ > line1| > | 2| this is line2| > | 3| this is \ > line3| > | 4| this is " line4| > | 5| this is line5| > +---+--+ > {code} > *Expected result:* > {code:java} > +---+--+ > |_c0| _c1| > +---+--+ > | 1|this is > line1| > | 2| this is line2| > | 3| this is > line3| > | 4| this is " line4| > | 5| this is line5| > +---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757326#comment-16757326 ] Nicholas Resnick edited comment on SPARK-17998 at 1/31/19 3:02 PM: --- Going to answer my question: it is in fact a function of the number of cores available. The key definition is here: [https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86] {code:java} Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) {code} The first two values are configurable, and the third is roughly the size of the file divided by the number of cores (according to sc.defaultParallelism.) If defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will equal bytesPerCore, which means we should expect around one partition for each core. was (Author: nresnick): Going to answer my question: it is in fact a function of the number of cores available. The key definition is here: [https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86] ``` Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) ``` The first two values are configurable, and the third is roughly the size of the file divided by the number of cores (according to sc.defaultParallelism.) If defaultMaxSplitBytes > bytesPerCore > openCostInBytes, then maxSplitBytes will equal bytesPerCore, which means we should expect around one partition for each core. > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes >Priority: Major > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25153) Improve error messages for columns with dots/periods
[ https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757286#comment-16757286 ] Mikhail commented on SPARK-25153: - Hello [~blavigne] Are you still working on this? Hello [~holdenk] Could you please provide the example of what you mean (current and expected behavior)? Thanks in advance > Improve error messages for columns with dots/periods > > > Key: SPARK-25153 > URL: https://issues.apache.org/jira/browse/SPARK-25153 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > Labels: starter > > When we fail to resolve a column name with a dot in it, and the column name > is present as a string literal the error message could mention using > backticks to have the string treated as a literal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26673) File source V2 write: create framework and migrate ORC to it
[ https://issues.apache.org/jira/browse/SPARK-26673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26673: --- Assignee: Gengliang Wang > File source V2 write: create framework and migrate ORC to it > > > Key: SPARK-26673 > URL: https://issues.apache.org/jira/browse/SPARK-26673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26673) File source V2 write: create framework and migrate ORC to it
[ https://issues.apache.org/jira/browse/SPARK-26673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26673. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23601 [https://github.com/apache/spark/pull/23601] > File source V2 write: create framework and migrate ORC to it > > > Key: SPARK-26673 > URL: https://issues.apache.org/jira/browse/SPARK-26673 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one
Zoltan Ivanfi created SPARK-26797: - Summary: Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one Key: SPARK-26797 URL: https://issues.apache.org/jira/browse/SPARK-26797 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Zoltan Ivanfi The 1.11.0 release of parquet-mr will deprecate its logical type API in favour of a newly introduced one. The new API also introduces new subtypes for different timestamp semantics, support for which should be added to Spark in order to read those types correctly. At this point only a release candidate of parquet-mr 1.11.0 is available, but that already allows implementing and reviewing this change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757209#comment-16757209 ] Zoltan Ivanfi edited comment on SPARK-26345 at 1/31/19 1:02 PM: Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature in parquet-mr did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release of parquet-mr. was (Author: zi): Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757209#comment-16757209 ] Zoltan Ivanfi commented on SPARK-26345: --- Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart
[ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757199#comment-16757199 ] Gabor Somogyi commented on SPARK-25136: --- [~kerbylane] did you have time to check it? > unable to use HDFS checkpoint directories after driver restart > -- > > Key: SPARK-25136 > URL: https://issues.apache.org/jira/browse/SPARK-25136 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Robert Reid >Priority: Major > > I have been attempting to work around the issue first discussed at the end of > SPARK-20894. The problem that we are encountering is the inability of the > Spark Driver to process additional data after restart. Upon restart it > reprocesses batches from the initial run. But when processing the first batch > it crashes because of missing checkpoint files. > We have a structured streaming job running on a Spark Standalone cluster. We > restart the Spark Driver to renew the Kerberos token in use to access HDFS. > Excerpts of log lines from a run are included below. In short, when the new > driver starts it reports that it is going to use a new checkpoint directory > in HDFS. Workers report that they have loaded a StateStoreProviders that > matches the directory. But then the worker reports that it cannot read the > delta files. This causes the driver to crash. > The HDFS directories are present but empty. Further, the directory > permissions for the original and new checkpoint directories are the same. The > worker never crashes. > As mentioned in SPARK-20894, deleting the _spark_metadata directory makes > subsequent restarts succeed. > Here is a timeline of log records from a recent run. > A new run began at 00:29:21. These entries from a worker log look good. > {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of > HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > for update}} > {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 > for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] > to file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}} > As the shutdown is occurring the worker reports > {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for > HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}} > The restart began at 00:39:38. > Driver log entries > {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = > e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = > b7ee0163-47db-4392-ab66-94d36ce63074]. Use > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 > to store the query checkpoint.}} > {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, > 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading > delta file > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > of HDFSStateStoreProvider[id = (op=0,part=0),dir = > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: > > hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta > does not exist}} > {{Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}} > > Worker log entries > {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance >
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-26795: Labels: (was: shuffle) > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > > > There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-26795: Target Version/s: 2.4.0, 2.3.2 (was: 2.3.2, 2.4.0) Labels: shuffle (was: ) > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > Labels: shuffle > Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > > > There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-26795: Description: There is a parameter *spark.maxRemoteBlockSizeFetchToMem*, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. was: There is a parameter `spark.maxRemoteBlockSizeFetchToMem`, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > > > There is a parameter *spark.maxRemoteBlockSizeFetchToMem*, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-26795: Description: There is a parameter `spark.maxRemoteBlockSizeFetchToMem`, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. was: There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > > > There is a parameter `spark.maxRemoteBlockSizeFetchToMem`, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-26795: Description: There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. was: There is a parameter *spark.maxRemoteBlockSizeFetchToMem*, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > > > There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase
[ https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-26795: Description: There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the remote block will be fetched to disk when size of the block is above this threshold in bytes. So during shuffle read phase, the managedBuffer which throw IOException may be a remote downloaded FileSegment and should be retried instead of throwFetchFailed directly. was:During shuffle read phase, the managedBuffer which throw IOException may be a remote download FileSegment and should be retry instead of throwFetchFailed directly. > Retry remote fileSegmentManagedBuffer when creating inputStream failed during > shuffle read phase > > > Key: SPARK-26795 > URL: https://issues.apache.org/jira/browse/SPARK-26795 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: feiwang >Priority: Major > Fix For: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > > > There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the > remote block will be fetched to disk when size of the block is above this > threshold in bytes. > So during shuffle read phase, the managedBuffer which throw IOException may > be a remote downloaded FileSegment and should be retried instead of > throwFetchFailed directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24023) Built-in SQL Functions improvement in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24023: Assignee: Huaxin Gao > Built-in SQL Functions improvement in SparkR > > > Key: SPARK-24023 > URL: https://issues.apache.org/jira/browse/SPARK-24023 > Project: Spark > Issue Type: Umbrella > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Huaxin Gao >Priority: Major > > This JIRA targets to add an R functions corresponding to SPARK-23899. We have > been usually adding a function with Scala alone or with both Scala and Python > APIs. > It's could be messy if there are duplicates for R sides in SPARK-23899. > Followup for each JIRA might be possible but then again messy to manage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24023) Built-in SQL Functions improvement in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24023. -- Resolution: Done > Built-in SQL Functions improvement in SparkR > > > Key: SPARK-24023 > URL: https://issues.apache.org/jira/browse/SPARK-24023 > Project: Spark > Issue Type: Umbrella > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Huaxin Gao >Priority: Major > > This JIRA targets to add an R functions corresponding to SPARK-23899. We have > been usually adding a function with Scala alone or with both Scala and Python > APIs. > It's could be messy if there are duplicates for R sides in SPARK-23899. > Followup for each JIRA might be possible but then again messy to manage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24779) Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off
[ https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24779: Assignee: Huaxin Gao > Add map_concat / map_from_entries / an option in months_between UDF to > disable rounding-off > > > Key: SPARK-24779 > URL: https://issues.apache.org/jira/browse/SPARK-24779 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Add R versions of > * map_concat -SPARK-23936- > * map_from_entries SPARK-23934 > * an option in months_between UDF to disable rounding-off -SPARK-23902- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24779) Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off
[ https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24779: - Summary: Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off (was: Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off) > Add map_concat / map_from_entries / an option in months_between UDF to > disable rounding-off > > > Key: SPARK-24779 > URL: https://issues.apache.org/jira/browse/SPARK-24779 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of > * map_concat -SPARK-23936- > * map_from_entries SPARK-23934 > * an option in months_between UDF to disable rounding-off -SPARK-23902- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24779) Add map_concat / map_from_entries / an option in months_between UDF to disable rounding-off
[ https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24779. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21835 [https://github.com/apache/spark/pull/21835] > Add map_concat / map_from_entries / an option in months_between UDF to > disable rounding-off > > > Key: SPARK-24779 > URL: https://issues.apache.org/jira/browse/SPARK-24779 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Add R versions of > * map_concat -SPARK-23936- > * map_from_entries SPARK-23934 > * an option in months_between UDF to disable rounding-off -SPARK-23902- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24779) Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off
[ https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24779: - Description: Add R versions of * map_concat -SPARK-23936- * map_from_entries SPARK-23934 * an option in months_between UDF to disable rounding-off -SPARK-23902- was: Add R versions of * sequence -SPARK-23927- * map_concat -SPARK-23936- * map_from_entries SPARK-23934 * an option in months_between UDF to disable rounding-off -SPARK-23902- > Add sequence / map_concat / map_from_entries / an option in months_between > UDF to disable rounding-off > --- > > Key: SPARK-24779 > URL: https://issues.apache.org/jira/browse/SPARK-24779 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of > * map_concat -SPARK-23936- > * map_from_entries SPARK-23934 > * an option in months_between UDF to disable rounding-off -SPARK-23902- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
[ https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade updated SPARK-26796: -- Environment: Ubuntu 16.04 Java Version openjdk version "1.8.0_192" OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9) Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Linux s390x-64-Bit Compressed References 20181107_80 (JIT enabled, AOT enabled) OpenJ9 - 090ff9dcd OMR - ea548a66 JCL - b5a3affe73 based on jdk8u192-b12) Hadoop Version Hadoop 2.7.1 Subversion Unknown -r Unknown Compiled by test on 2019-01-29T09:09Z Compiled with protoc 2.5.0 From source with checksum 5e94a235f9a71834e2eb73fb36ee873f This command was run using /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar was: Ubuntu 16.04 Java Version openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12) OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode) Hadoop Version Hadoop 2.7.1 Subversion Unknown -r Unknown Compiled by test on 2019-01-29T09:09Z Compiled with protoc 2.5.0 From source with checksum 5e94a235f9a71834e2eb73fb36ee873f This command was run using /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar > Testcases failing with "org.apache.hadoop.fs.ChecksumException" error > - > > Key: SPARK-26796 > URL: https://issues.apache.org/jira/browse/SPARK-26796 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.2, 2.4.0 > Environment: Ubuntu 16.04 > Java Version > openjdk version "1.8.0_192" > OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9) > Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Linux s390x-64-Bit > Compressed References 20181107_80 (JIT enabled, AOT enabled) > OpenJ9 - 090ff9dcd > OMR - ea548a66 > JCL - b5a3affe73 based on jdk8u192-b12) > > Hadoop Version > Hadoop 2.7.1 > Subversion Unknown -r Unknown > Compiled by test on 2019-01-29T09:09Z > Compiled with protoc 2.5.0 > From source with checksum 5e94a235f9a71834e2eb73fb36ee873f > This command was run using > /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar > > > >Reporter: Anuja Jakhade >Priority: Major > > Observing test case failures due to Checksum error > Below is the error log > [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time > elapsed: 1.232 s <<< ERROR! > org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor > driver): org.apache.hadoop.fs.ChecksumException: Checksum error: > file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0 > at 0 exp: 222499834 got: 1400184476 > at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279) > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769) > at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785) > at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262) > at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968) > at java.io.ObjectInputStream.(ObjectInputStream.java:390) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at >