Re: [R] discuss: removing lint-r checks for old branches
they do seem like real failures on branches 2.0 and 2.1. regarding infrastructure, centos and ubuntu have lintr pinned to 1.0.1.9000, and installed via: devtools::install_github('jimhester/lintr@5431140') builds on branches 2.2+ (and master) are passing R lint checks on both OSes as well. this includes PRB builds too. we're really close! for once, i feel comfortable saying that i have the R ecosystem locked down, reproducible and working. :) shane On Sat, Aug 11, 2018 at 10:08 AM, Felix Cheung wrote: > SGTM for old branches. > > I recall we need to upgrade to newer lintr since it is missing some tests. > > Also these seems like real test failures? Are these only happening in 2.1 > and 2.2? > > > -- > *From:* shane knapp > *Sent:* Friday, August 10, 2018 4:04 PM > *To:* Sean Owen > *Cc:* Shivaram Venkataraman; Reynold Xin; dev > *Subject:* Re: [R] discuss: removing lint-r checks for old branches > > /agreemsg > > On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen wrote: > >> Seems OK to proceed with shutting off lintr, as it was masking those. >> >> On Fri, Aug 10, 2018 at 6:01 PM shane knapp wrote: >> >>> ugh... R unit tests failed on both of these builds. >>> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequest >>> Builder/94583/artifact/R/target/ >>> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequest >>> Builder/94584/artifact/R/target/ >>> >>> >>> >>> On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman < >>> shiva...@eecs.berkeley.edu> wrote: >>> Sounds good to me as well. Thanks Shane. Shivaram On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin wrote: > > SGTM > > On Fri, Aug 10, 2018 at 1:39 PM shane knapp wrote: >> >> https://issues.apache.org/jira/browse/SPARK-25089 >> >> basically since these branches are old, and there will be a greater than zero amount of work to get lint-r to pass (on the new ubuntu workers), sean and i are proposing to remove the lint-r checks for the builds. >> >> this is super not important for the 2.4 cut/code freeze, but i wanted to get this done before it gets pushed down my queue and before we revisit the ubuntu port. >> >> thanks in advance, >> >> shane >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >>> >>> >>> >>> -- >>> Shane Knapp >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> > > > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
[Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?
Hi All, I was going through this pull request about new CheckpointFileManager abstraction in structured streaming coming in 2.4 : https://issues.apache.org/jira/browse/SPARK-23966 https://github.com/apache/spark/pull/21048 I went through the code in detail and found it will indtroduce a very nice abstraction which is much cleaner and extensible for Direct Writes File System like S3 (in addition to current HDFS file system). *But I am unable to understand, is it really solving some problem in exsisting State Store code which is currently existing in Spark 2.3 ? * *My questions related to above statements in State Store code : * *PR description*:: "Checkpoint files must be written atomically such that *no partial files are generated*. *QUESTION*: When are partial files generated in current code ? I can see that data is first written to temp-delta file and then renamed to version.delta file. If something bad happens, the task will fail due to thrown exception and abort() will be called on store to close and delete tempDeltaFileStream . I think it is quite clean, what is the case that partial files might be generated ? *PR description*:: *State Store behavior is incorrect - HDFS FileSystem implementation does not have atomic rename*" *QUESTION*: Hdfs filesystem rename operation is atomic, I think above line takes into account about checking existing file if exists and then taking appropriate action which together makes the file renaming operation multi-steps and hence non-atomic. But why this behaviour is incorrect ? Even if multiple executors try to write to the same version.delta file, only 1st of them will succeed, the second one will see the file exists and will delete its temp-delta file. Looks good . Anything I am missing here? Really curious to know which corner cases we are trying to solve by this new pull request ? Regards, Chandan
Re: [R] discuss: removing lint-r checks for old branches
SGTM for old branches. I recall we need to upgrade to newer lintr since it is missing some tests. Also these seems like real test failures? Are these only happening in 2.1 and 2.2? From: shane knapp Sent: Friday, August 10, 2018 4:04 PM To: Sean Owen Cc: Shivaram Venkataraman; Reynold Xin; dev Subject: Re: [R] discuss: removing lint-r checks for old branches /agreemsg On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen mailto:sro...@gmail.com>> wrote: Seems OK to proceed with shutting off lintr, as it was masking those. On Fri, Aug 10, 2018 at 6:01 PM shane knapp mailto:skn...@berkeley.edu>> wrote: ugh... R unit tests failed on both of these builds. https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/ https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94584/artifact/R/target/ On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman mailto:shiva...@eecs.berkeley.edu>> wrote: Sounds good to me as well. Thanks Shane. Shivaram On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin mailto:r...@databricks.com>> wrote: > > SGTM > > On Fri, Aug 10, 2018 at 1:39 PM shane knapp > mailto:skn...@berkeley.edu>> wrote: >> >> https://issues.apache.org/jira/browse/SPARK-25089 >> >> basically since these branches are old, and there will be a greater than >> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean >> and i are proposing to remove the lint-r checks for the builds. >> >> this is super not important for the 2.4 cut/code freeze, but i wanted to get >> this done before it gets pushed down my queue and before we revisit the >> ubuntu port. >> >> thanks in advance, >> >> shane >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: code freeze and branch cut for Apache Spark 2.4
Hi, I made some changes to SPARK-24020 (https://github.com/apache/spark/pull/21109) and implemented spill-over to disk. I believe there are no objections to the implementation left and that this can now be merged. Please take a look. Thanks, Petar Zečević Wenchen Fan @ 1970-01-01 01:00 CET: > Some updates for the JIRA tickets that we want to resolve before Spark 2.4. > > green: merged > orange: in progress > red: likely to miss > > SPARK-24374: Support Barrier Execution Mode in Apache Spark > The core functionality is finished, but we still need to add Python API. > Tracked by SPARK-24822 > > SPARK-23899: Built-in SQL Function Improvement > I think it's ready to go. Although there are still some functions working in > progress, the common ones are all merged. > > SPARK-14220: Build and test Spark against Scala 2.12 > It's close, just one last piece. Tracked by SPARK-25029 > > SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet > Being reviewed. > > SPARK-24882: data source v2 API improvement > PR is out, being reviewed. > > SPARK-24252: Add catalog support in Data Source V2 > Being reviewed. > > SPARK-24768: Have a built-in AVRO data source implementation > It's close, just one last piece: the decimal type support > > SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers > It turns out to be a very complicated issue, there is no consensus about what > is the right fix yet. Likely to miss it in Spark 2.4 because it's a > long-standing issue, not a regression. > > SPARK-24598: Datatype overflow conditions gives incorrect result > We decided to keep the current behavior in Spark 2.4 and add some > document(already done). We will re-consider this change in Spark 3.0. > > SPARK-24020: Sort-merge join inner range optimization > There are some discussions about the design, I don't think we can get to a > consensus within Spark 2.4. > > SPARK-24296: replicating large blocks over 2GB > Being reviewed. > > SPARK-23874: upgrade to Apache Arrow 0.10.0 > Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we should > wait a few days. > > According to the status, I think we should wait a few more days. Any > objections? > > Thanks, > Wenchen > > On Tue, Aug 7, 2018 at 3:39 AM Sean Owen wrote: > > ... and we still have a few snags with Scala 2.12 support at > https://issues.apache.org/jira/browse/SPARK-25029 > > There is some hope of resolving it on the order of a week, so for the > moment, seems worth holding 2.4 for. > > On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler wrote: > > Hi All, > > I'd like to request a few days extension to the code freeze to complete the > upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several > key improvements and bug fixes. The RC vote just passed this morning and code > changes are complete in https://github.com/apache/spark/pull/21939. We just > need some time for the release artifacts to be available. Thoughts? > > Thanks, > Bryan - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org