Re: [R] discuss: removing lint-r checks for old branches

2018-08-11 Thread shane knapp
they do seem like real failures on branches 2.0 and 2.1.

regarding infrastructure, centos and ubuntu have lintr pinned to
1.0.1.9000, and installed via:
devtools::install_github('jimhester/lintr@5431140')

builds on branches 2.2+ (and master) are passing R lint checks on both OSes
as well.  this includes PRB builds too.  we're really close!

for once, i feel comfortable saying that i have the R ecosystem locked
down, reproducible and working.  :)

shane


On Sat, Aug 11, 2018 at 10:08 AM, Felix Cheung 
wrote:

> SGTM for old branches.
>
> I recall we need to upgrade to newer lintr since it is missing some tests.
>
> Also these seems like real test failures? Are these only happening in 2.1
> and 2.2?
>
>
> --
> *From:* shane knapp 
> *Sent:* Friday, August 10, 2018 4:04 PM
> *To:* Sean Owen
> *Cc:* Shivaram Venkataraman; Reynold Xin; dev
> *Subject:* Re: [R] discuss: removing lint-r checks for old branches
>
> /agreemsg
>
> On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen  wrote:
>
>> Seems OK to proceed with shutting off lintr, as it was masking those.
>>
>> On Fri, Aug 10, 2018 at 6:01 PM shane knapp  wrote:
>>
>>> ugh...  R unit tests failed on both of these builds.
>>> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequest
>>> Builder/94583/artifact/R/target/
>>> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequest
>>> Builder/94584/artifact/R/target/
>>>
>>>
>>>
>>> On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Sounds good to me as well. Thanks Shane.

 Shivaram
 On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin 
 wrote:
 >
 > SGTM
 >
 > On Fri, Aug 10, 2018 at 1:39 PM shane knapp 
 wrote:
 >>
 >> https://issues.apache.org/jira/browse/SPARK-25089
 >>
 >> basically since these branches are old, and there will be a greater
 than zero amount of work to get lint-r to pass (on the new ubuntu workers),
 sean and i are proposing to remove the lint-r checks for the builds.
 >>
 >> this is super not important for the 2.4 cut/code freeze, but i
 wanted to get this done before it gets pushed down my queue and before we
 revisit the ubuntu port.
 >>
 >> thanks in advance,
 >>
 >> shane
 >> --
 >> Shane Knapp
 >> UC Berkeley EECS Research / RISELab Staff Technical Lead
 >> https://rise.cs.berkeley.edu

>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

2018-08-11 Thread chandan prakash
Hi All,
I was going through this pull request about new CheckpointFileManager
abstraction in structured streaming coming in 2.4 :
https://issues.apache.org/jira/browse/SPARK-23966
https://github.com/apache/spark/pull/21048

I went through the code in detail and found it will indtroduce a very nice
abstraction which is much cleaner and extensible for Direct Writes File
System like S3 (in addition to current HDFS file system).

*But I am unable to understand, is it really solving some problem in
exsisting State Store code which is currently  existing in Spark 2.3 ? *

*My questions related to above statements in State Store code : *
 *PR description*:: "Checkpoint files must be written atomically such that *no
partial files are generated*.
*QUESTION*: When are partial files generated in current code ?  I can see
that data is first written to temp-delta file and then renamed to
version.delta file. If something bad happens, the task will fail due to
thrown exception and abort() will be called on store to close and delete
tempDeltaFileStream . I think it is quite clean, what is the case that
partial files might be generated ?

 *PR description*:: *State Store behavior is incorrect - HDFS FileSystem
implementation does not have atomic rename*"
*QUESTION*:  Hdfs filesystem rename operation is atomic, I think above line
takes into account about checking existing file if exists and then taking
appropriate action which together makes the file renaming operation
multi-steps and hence non-atomic. But why this behaviour is incorrect ?
Even if multiple executors try to write to the same version.delta file,
only 1st of them will succeed, the second one will see the file exists and
will delete its temp-delta file. Looks good .

Anything I am missing here?
Really curious to know which corner cases we are trying to solve by this
new pull request ?

Regards,
Chandan


Re: [R] discuss: removing lint-r checks for old branches

2018-08-11 Thread Felix Cheung
SGTM for old branches.

I recall we need to upgrade to newer lintr since it is missing some tests.

Also these seems like real test failures? Are these only happening in 2.1 and 
2.2?



From: shane knapp 
Sent: Friday, August 10, 2018 4:04 PM
To: Sean Owen
Cc: Shivaram Venkataraman; Reynold Xin; dev
Subject: Re: [R] discuss: removing lint-r checks for old branches

/agreemsg

On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen 
mailto:sro...@gmail.com>> wrote:
Seems OK to proceed with shutting off lintr, as it was masking those.

On Fri, Aug 10, 2018 at 6:01 PM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
ugh...  R unit tests failed on both of these builds.
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94584/artifact/R/target/



On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>> wrote:
Sounds good to me as well. Thanks Shane.

Shivaram
On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
>
> SGTM
>
> On Fri, Aug 10, 2018 at 1:39 PM shane knapp 
> mailto:skn...@berkeley.edu>> wrote:
>>
>> https://issues.apache.org/jira/browse/SPARK-25089
>>
>> basically since these branches are old, and there will be a greater than 
>> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean 
>> and i are proposing to remove the lint-r checks for the builds.
>>
>> this is super not important for the 2.4 cut/code freeze, but i wanted to get 
>> this done before it gets pushed down my queue and before we revisit the 
>> ubuntu port.
>>
>> thanks in advance,
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-11 Thread Petar Zečević


Hi, I made some changes to SPARK-24020 
(https://github.com/apache/spark/pull/21109) and implemented spill-over to 
disk. I believe there are no objections to the implementation left and that 
this can now be merged.

Please take a look.

Thanks,

Petar Zečević


Wenchen Fan @ 1970-01-01 01:00 CET:

> Some updates for the JIRA tickets that we want to resolve before Spark 2.4.
>
> green: merged
> orange: in progress
> red: likely to miss
>
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> The core functionality is finished, but we still need to add Python API. 
> Tracked by SPARK-24822
>
> SPARK-23899: Built-in SQL Function Improvement
> I think it's ready to go. Although there are still some functions working in 
> progress, the common ones are all merged.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> It's close, just one last piece. Tracked by SPARK-25029
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> Being reviewed.
>
> SPARK-24882: data source v2 API improvement
> PR is out, being reviewed.
>
> SPARK-24252: Add catalog support in Data Source V2
> Being reviewed.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> It's close, just one last piece: the decimal type support
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about what 
> is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
> long-standing issue, not a regression.
>
> SPARK-24598: Datatype overflow conditions gives incorrect result
> We decided to keep the current behavior in Spark 2.4 and add some 
> document(already done). We will re-consider this change in Spark 3.0.
>
> SPARK-24020: Sort-merge join inner range optimization
> There are some discussions about the design, I don't think we can get to a 
> consensus within Spark 2.4.
>
> SPARK-24296: replicating large blocks over 2GB
> Being reviewed.
>
> SPARK-23874: upgrade to Apache Arrow 0.10.0
> Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we should 
> wait a few days.
>
> According to the status, I think we should wait a few more days. Any 
> objections?
>
> Thanks,
> Wenchen
>
> On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:
>
>  ... and we still have a few snags with Scala 2.12 support at 
> https://issues.apache.org/jira/browse/SPARK-25029 
>
>  There is some hope of resolving it on the order of a week, so for the 
> moment, seems worth holding 2.4 for.
>
>  On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>  Hi All,
>
>  I'd like to request a few days extension to the code freeze to complete the 
> upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several 
> key improvements and bug fixes.  The RC vote just passed this morning and code
>  changes are complete in https://github.com/apache/spark/pull/21939. We just 
> need some time for the release artifacts to be available. Thoughts?
>
>  Thanks,
>  Bryan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org