Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Gengliang Wang
Hi Chao & DB,

Actually, I cut the RC2 yesterday before your post the Parquet issue:
https://github.com/apache/spark/tree/v3.2.0-rc2
It has been 11 days since RC1. I think we can have RC2 today so that the
community can test and find potential issues earlier.
As for the Parquet issue, we can treat it as a known blocker. If it takes
more than one week(which is not likely to happen), we will have to consider
reverting Parquet 1.12 and related features from branch-3.2.

Gengliang

On Wed, Sep 1, 2021 at 5:40 AM DB Tsai  wrote:

> Hello Xiao, there are multiple patches in Spark 3.2 depending on parquet
> 1.12, so it might be easier to wait for the fix in parquet community
> instead of reverting all the related changes. The fix in parquet community
> is very trivial, and we hope that it will not take too long. Thanks.
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
>
> On Tue, Aug 31, 2021 at 1:09 PM Chao Sun  wrote:
>
>> Hi Xiao, I'm still checking with the Parquet community on this. Since the
>> fix is already +1'd, I'm hoping this won't take long. The delta in
>> parquet-1.12.x branch is also small with just 2 commits so far.
>>
>> Chao
>>
>> On Tue, Aug 31, 2021 at 12:03 PM Xiao Li  wrote:
>>
>>> Hi, Chao,
>>>
>>> How long will it take? Normally, in the RC stage, we always revert the
>>> upgrade made in the current release. We did the parquet upgrade multiple
>>> times in the previous releases for avoiding the major delay in our Spark
>>> release
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:
>>>
 The Apache Parquet community found an issue [1] in 1.12.0 which could
 cause incorrect file offset being written and subsequently reading of the
 same file to fail. A fix has been proposed in the same JIRA and we may have
 to wait until a new release is available so that we can upgrade Spark with
 the hot fix.

 [1]: https://issues.apache.org/jira/browse/PARQUET-2078

 On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:

> Maybe, I'm just confused why it's needed at all. Other profiles that
> add a dependency seem OK, but something's different here.
>
> One thing we can/should change is to simply remove the
>  block in the profile. It should always be a direct
> dep in Scala 2.13 (which lets us take out the profiles in submodules, 
> which
> just repeat that)
> We can also update the version, by the by.
>
> I tried this and the resulting POM still doesn't look like what I
> expect though.
>
> (The binary release is OK, FWIW - it gets pulled in as a JAR as
> expected)
>
> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
> wrote:
>
>> Hi Sean,
>>
>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ 
>> will
>> help you out here.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>
>> OK right, you would have seen a different error otherwise.
>>
>> Yes profiles are only a compile-time thing, but they should affect
>> the effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom
>> shows scala-parallel-collections as a dependency in the POM as expected
>> (not in a profile). However I see what you see in the .pom in the release
>> repo, and in my local repo after building - it's just sitting there as a
>> profile as if it weren't activated or something.
>>
>> I'm confused then, that shouldn't be what happens. I'd say maybe
>> there is a problem with the release script, but seems to affect a simple
>> local build. Anyone else more expert in this see the problem, while I try
>> to debug more?
>> The binary distro may actually be fine, I'll check; it may even not
>> matter much for users who generally just treat Spark as a 
>> compile-time-only
>> dependency either. But I can see it would break exactly your case,
>> something like a self-contained test job.
>>
>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>> wrote:
>>
>>> I did indeed.
>>>
>>> The generated spark-core_2.13-3.2.0.pom that is created alongside
>>> the jar file in the local repo contains:
>>>
>>> 
>>>   scala-2.13
>>>   
>>> 
>>>   org.scala-lang.modules
>>>
>>> scala-parallel-collections_${scala.binary.version}
>>> 
>>>   
>>> 
>>>
>>> which means this dependency will be missing for unit tests that
>>> create SparkSessions from library code only, a technique inspired by
>>> Spark’s own unit tests.
>>>
>>> Cheers,
>>>
>>> Steve C
>>>
>>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>>
>>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required
>>> first to update POMs. It works fine for me.
>>>
>>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>>> 

Re: CRAN package SparkR

2021-08-31 Thread Felix Cheung
I think a few lines to add the prompt might be enough. This checks for
interactive()

https://github.com/apache/spark/blob/c6a2021fec5bab9069fbfba33f75d4415ea76e99/R/pkg/R/sparkR.R#L658


On Tue, Aug 31, 2021 at 5:55 PM Hyukjin Kwon  wrote:

> Oh I missed this. Yes, can we simply get the user' confirmation when we
> install.spark?
> IIRC, the auto installation is only triggered by interactive shell so
> getting user's confirmation should be fine.
>
> 2021년 6월 18일 (금) 오전 2:54, Felix Cheung 님이 작성:
>
>> Any suggestion or comment on this? They are going to remove the package
>> by 6-28
>>
>> Seems to me if we have a switch to opt in to install (and not by default
>> on), or prompt the user in interactive session, should be good as user
>> confirmation.
>>
>>
>>
>> On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
>> wrote:
>>
>>> It looks like they would not allow caching the Spark
>>> Distribution.
>>>
>>> I’m not sure what can be done about this.
>>>
>>> If I recall, the package should remove this during test. Or maybe
>>> spark.install() ie optional (hence getting user confirmation?)
>>>
>>>
>>> -- Forwarded message -
>>> Date: Sun, Jun 13, 2021 at 10:19 PM
>>> Subject: CRAN package SparkR
>>> To: Felix Cheung 
>>> CC: 
>>>
>>>
>>> Dear maintainer,
>>>
>>> Checking this apparently creates the default directory as per
>>>
>>> #' @param localDir a local directory where Spark is installed. The
>>> directory con
>>> tains
>>> #' version-specific folders of Spark packages. Default
>>> is path t
>>> o
>>> #' the cache directory:
>>> #' \itemize{
>>> #'   \item Mac OS X: \file{~/Library/Caches/spark}
>>> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
>>> otherwise \file{~/.cache/spark}
>>> #'   \item Windows:
>>> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
>>> #' }
>>>
>>> However, the CRAN Policy says
>>>
>>>   - Packages should not write in the user’s home filespace (including
>>> clipboards), nor anywhere else on the file system apart from the R
>>> session’s temporary directory (or during installation in the
>>> location pointed to by TMPDIR: and such usage should be cleaned
>>> up). Installing into the system’s R installation (e.g., scripts to
>>> its bin directory) is not allowed.
>>>
>>> Limited exceptions may be allowed in interactive sessions if the
>>> package obtains confirmation from the user.
>>>
>>> For R version 4.0 or later (hence a version dependency is required
>>> or only conditional use is possible), packages may store
>>> user-specific data, configuration and cache files in their
>>> respective user directories obtained from tools::R_user_dir(),
>>> provided that by default sizes are kept as small as possible and the
>>> contents are actively managed (including removing outdated
>>> material).
>>>
>>> Can you pls fix as necessary?
>>>
>>> Please fix before 2021-06-28 to safely retain your package on CRAN.
>>>
>>> Best
>>> -k
>>>
>>


Re: CRAN package SparkR

2021-08-31 Thread Hyukjin Kwon
Oh I missed this. Yes, can we simply get the user' confirmation when we
install.spark?
IIRC, the auto installation is only triggered by interactive shell so
getting user's confirmation should be fine.

2021년 6월 18일 (금) 오전 2:54, Felix Cheung 님이 작성:

> Any suggestion or comment on this? They are going to remove the package by
> 6-28
>
> Seems to me if we have a switch to opt in to install (and not by default
> on), or prompt the user in interactive session, should be good as user
> confirmation.
>
>
>
> On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
> wrote:
>
>> It looks like they would not allow caching the Spark
>> Distribution.
>>
>> I’m not sure what can be done about this.
>>
>> If I recall, the package should remove this during test. Or maybe
>> spark.install() ie optional (hence getting user confirmation?)
>>
>>
>> -- Forwarded message -
>> Date: Sun, Jun 13, 2021 at 10:19 PM
>> Subject: CRAN package SparkR
>> To: Felix Cheung 
>> CC: 
>>
>>
>> Dear maintainer,
>>
>> Checking this apparently creates the default directory as per
>>
>> #' @param localDir a local directory where Spark is installed. The
>> directory con
>> tains
>> #' version-specific folders of Spark packages. Default is
>> path t
>> o
>> #' the cache directory:
>> #' \itemize{
>> #'   \item Mac OS X: \file{~/Library/Caches/spark}
>> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
>> otherwise \file{~/.cache/spark}
>> #'   \item Windows:
>> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
>> #' }
>>
>> However, the CRAN Policy says
>>
>>   - Packages should not write in the user’s home filespace (including
>> clipboards), nor anywhere else on the file system apart from the R
>> session’s temporary directory (or during installation in the
>> location pointed to by TMPDIR: and such usage should be cleaned
>> up). Installing into the system’s R installation (e.g., scripts to
>> its bin directory) is not allowed.
>>
>> Limited exceptions may be allowed in interactive sessions if the
>> package obtains confirmation from the user.
>>
>> For R version 4.0 or later (hence a version dependency is required
>> or only conditional use is possible), packages may store
>> user-specific data, configuration and cache files in their
>> respective user directories obtained from tools::R_user_dir(),
>> provided that by default sizes are kept as small as possible and the
>> contents are actively managed (including removing outdated
>> material).
>>
>> Can you pls fix as necessary?
>>
>> Please fix before 2021-06-28 to safely retain your package on CRAN.
>>
>> Best
>> -k
>>
>


Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread DB Tsai
Hello Xiao, there are multiple patches in Spark 3.2 depending on parquet
1.12, so it might be easier to wait for the fix in parquet community
instead of reverting all the related changes. The fix in parquet community
is very trivial, and we hope that it will not take too long. Thanks.
DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1


On Tue, Aug 31, 2021 at 1:09 PM Chao Sun  wrote:

> Hi Xiao, I'm still checking with the Parquet community on this. Since the
> fix is already +1'd, I'm hoping this won't take long. The delta in
> parquet-1.12.x branch is also small with just 2 commits so far.
>
> Chao
>
> On Tue, Aug 31, 2021 at 12:03 PM Xiao Li  wrote:
>
>> Hi, Chao,
>>
>> How long will it take? Normally, in the RC stage, we always revert the
>> upgrade made in the current release. We did the parquet upgrade multiple
>> times in the previous releases for avoiding the major delay in our Spark
>> release
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:
>>
>>> The Apache Parquet community found an issue [1] in 1.12.0 which could
>>> cause incorrect file offset being written and subsequently reading of the
>>> same file to fail. A fix has been proposed in the same JIRA and we may have
>>> to wait until a new release is available so that we can upgrade Spark with
>>> the hot fix.
>>>
>>> [1]: https://issues.apache.org/jira/browse/PARQUET-2078
>>>
>>> On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:
>>>
 Maybe, I'm just confused why it's needed at all. Other profiles that
 add a dependency seem OK, but something's different here.

 One thing we can/should change is to simply remove the
  block in the profile. It should always be a direct
 dep in Scala 2.13 (which lets us take out the profiles in submodules, which
 just repeat that)
 We can also update the version, by the by.

 I tried this and the resulting POM still doesn't look like what I
 expect though.

 (The binary release is OK, FWIW - it gets pulled in as a JAR as
 expected)

 On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
 wrote:

> Hi Sean,
>
> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
> help you out here.
>
> Cheers,
>
> Steve C
>
> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>
> OK right, you would have seen a different error otherwise.
>
> Yes profiles are only a compile-time thing, but they should affect the
> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
> scala-parallel-collections as a dependency in the POM as expected (not in 
> a
> profile). However I see what you see in the .pom in the release repo, and
> in my local repo after building - it's just sitting there as a profile as
> if it weren't activated or something.
>
> I'm confused then, that shouldn't be what happens. I'd say maybe there
> is a problem with the release script, but seems to affect a simple local
> build. Anyone else more expert in this see the problem, while I try to
> debug more?
> The binary distro may actually be fine, I'll check; it may even not
> matter much for users who generally just treat Spark as a 
> compile-time-only
> dependency either. But I can see it would break exactly your case,
> something like a self-contained test job.
>
> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
> wrote:
>
>> I did indeed.
>>
>> The generated spark-core_2.13-3.2.0.pom that is created alongside the
>> jar file in the local repo contains:
>>
>> 
>>   scala-2.13
>>   
>> 
>>   org.scala-lang.modules
>>
>> scala-parallel-collections_${scala.binary.version}
>> 
>>   
>> 
>>
>> which means this dependency will be missing for unit tests that
>> create SparkSessions from library code only, a technique inspired by
>> Spark’s own unit tests.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>
>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required
>> first to update POMs. It works fine for me.
>>
>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>> s...@infomedia.com.au.invalid> wrote:
>>
>>> Hi all,
>>>
>>> Being adventurous I have built the RC1 code with:
>>>
>>> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
>>> -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2
>>>
>>>
>>> And then attempted to build my Java based spark application.
>>>
>>> However, I found a number of our unit tests were failing with:
>>>
>>> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>>>
>>> at
>>> org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
>>> at
>>> 

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Chao Sun
Hi Xiao, I'm still checking with the Parquet community on this. Since the
fix is already +1'd, I'm hoping this won't take long. The delta in
parquet-1.12.x branch is also small with just 2 commits so far.

Chao

On Tue, Aug 31, 2021 at 12:03 PM Xiao Li  wrote:

> Hi, Chao,
>
> How long will it take? Normally, in the RC stage, we always revert the
> upgrade made in the current release. We did the parquet upgrade multiple
> times in the previous releases for avoiding the major delay in our Spark
> release
>
> Thanks,
>
> Xiao
>
>
> On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:
>
>> The Apache Parquet community found an issue [1] in 1.12.0 which could
>> cause incorrect file offset being written and subsequently reading of the
>> same file to fail. A fix has been proposed in the same JIRA and we may have
>> to wait until a new release is available so that we can upgrade Spark with
>> the hot fix.
>>
>> [1]: https://issues.apache.org/jira/browse/PARQUET-2078
>>
>> On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:
>>
>>> Maybe, I'm just confused why it's needed at all. Other profiles that add
>>> a dependency seem OK, but something's different here.
>>>
>>> One thing we can/should change is to simply remove the
>>>  block in the profile. It should always be a direct
>>> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
>>> just repeat that)
>>> We can also update the version, by the by.
>>>
>>> I tried this and the resulting POM still doesn't look like what I expect
>>> though.
>>>
>>> (The binary release is OK, FWIW - it gets pulled in as a JAR as expected)
>>>
>>> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
>>> wrote:
>>>
 Hi Sean,

 I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
 help you out here.

 Cheers,

 Steve C

 On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:

 OK right, you would have seen a different error otherwise.

 Yes profiles are only a compile-time thing, but they should affect the
 effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
 scala-parallel-collections as a dependency in the POM as expected (not in a
 profile). However I see what you see in the .pom in the release repo, and
 in my local repo after building - it's just sitting there as a profile as
 if it weren't activated or something.

 I'm confused then, that shouldn't be what happens. I'd say maybe there
 is a problem with the release script, but seems to affect a simple local
 build. Anyone else more expert in this see the problem, while I try to
 debug more?
 The binary distro may actually be fine, I'll check; it may even not
 matter much for users who generally just treat Spark as a compile-time-only
 dependency either. But I can see it would break exactly your case,
 something like a self-contained test job.

 On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
 wrote:

> I did indeed.
>
> The generated spark-core_2.13-3.2.0.pom that is created alongside the
> jar file in the local repo contains:
>
> 
>   scala-2.13
>   
> 
>   org.scala-lang.modules
>
> scala-parallel-collections_${scala.binary.version}
> 
>   
> 
>
> which means this dependency will be missing for unit tests that create
> SparkSessions from library code only, a technique inspired by Spark’s own
> unit tests.
>
> Cheers,
>
> Steve C
>
> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>
> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first
> to update POMs. It works fine for me.
>
> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
> s...@infomedia.com.au.invalid> wrote:
>
>> Hi all,
>>
>> Being adventurous I have built the RC1 code with:
>>
>> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
>> -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2
>>
>>
>> And then attempted to build my Java based spark application.
>>
>> However, I found a number of our unit tests were failing with:
>>
>> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>>
>> at
>> org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
>> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
>> at
>> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
>> at
>> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>> …
>>
>>
>> I tracked this down to a missing 

[build system] DNS outage @ uc berkeley, jenkins not available

2021-08-31 Thread shane knapp ☠
we're having some DNS issues here in the EECS department, and our
crack team is working on getting it resolved asap.  until then,
jenkins isn't visible to the outside world.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Xiao Li
Hi, Chao,

How long will it take? Normally, in the RC stage, we always revert the
upgrade made in the current release. We did the parquet upgrade multiple
times in the previous releases for avoiding the major delay in our Spark
release

Thanks,

Xiao


On Tue, Aug 31, 2021 at 11:03 AM Chao Sun  wrote:

> The Apache Parquet community found an issue [1] in 1.12.0 which could
> cause incorrect file offset being written and subsequently reading of the
> same file to fail. A fix has been proposed in the same JIRA and we may have
> to wait until a new release is available so that we can upgrade Spark with
> the hot fix.
>
> [1]: https://issues.apache.org/jira/browse/PARQUET-2078
>
> On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:
>
>> Maybe, I'm just confused why it's needed at all. Other profiles that add
>> a dependency seem OK, but something's different here.
>>
>> One thing we can/should change is to simply remove the
>>  block in the profile. It should always be a direct
>> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
>> just repeat that)
>> We can also update the version, by the by.
>>
>> I tried this and the resulting POM still doesn't look like what I expect
>> though.
>>
>> (The binary release is OK, FWIW - it gets pulled in as a JAR as expected)
>>
>> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
>> wrote:
>>
>>> Hi Sean,
>>>
>>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
>>> help you out here.
>>>
>>> Cheers,
>>>
>>> Steve C
>>>
>>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>>
>>> OK right, you would have seen a different error otherwise.
>>>
>>> Yes profiles are only a compile-time thing, but they should affect the
>>> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
>>> scala-parallel-collections as a dependency in the POM as expected (not in a
>>> profile). However I see what you see in the .pom in the release repo, and
>>> in my local repo after building - it's just sitting there as a profile as
>>> if it weren't activated or something.
>>>
>>> I'm confused then, that shouldn't be what happens. I'd say maybe there
>>> is a problem with the release script, but seems to affect a simple local
>>> build. Anyone else more expert in this see the problem, while I try to
>>> debug more?
>>> The binary distro may actually be fine, I'll check; it may even not
>>> matter much for users who generally just treat Spark as a compile-time-only
>>> dependency either. But I can see it would break exactly your case,
>>> something like a self-contained test job.
>>>
>>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>>> wrote:
>>>
 I did indeed.

 The generated spark-core_2.13-3.2.0.pom that is created alongside the
 jar file in the local repo contains:

 
   scala-2.13
   
 
   org.scala-lang.modules

 scala-parallel-collections_${scala.binary.version}
 
   
 

 which means this dependency will be missing for unit tests that create
 SparkSessions from library code only, a technique inspired by Spark’s own
 unit tests.

 Cheers,

 Steve C

 On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:

 Did you run ./dev/change-scala-version.sh 2.13 ? that's required first
 to update POMs. It works fine for me.

 On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
 s...@infomedia.com.au.invalid> wrote:

> Hi all,
>
> Being adventurous I have built the RC1 code with:
>
> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
> -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2
>
>
> And then attempted to build my Java based spark application.
>
> However, I found a number of our unit tests were failing with:
>
> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>
> at
> org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
> at
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
> at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> …
>
>
> I tracked this down to a missing dependency:
>
> 
>   org.scala-lang.modules
>
> scala-parallel-collections_${scala.binary.version}
> 
>
>
> which unfortunately appears only in a profile in the pom files
> associated with the various spark dependencies.
>
> As far as I know it is not possible to activate profiles in
> dependencies in maven builds.
>
> Therefore I suspect that 

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-31 Thread Chao Sun
The Apache Parquet community found an issue [1] in 1.12.0 which could cause
incorrect file offset being written and subsequently reading of the same
file to fail. A fix has been proposed in the same JIRA and we may have to
wait until a new release is available so that we can upgrade Spark with the
hot fix.

[1]: https://issues.apache.org/jira/browse/PARQUET-2078

On Fri, Aug 27, 2021 at 7:06 AM Sean Owen  wrote:

> Maybe, I'm just confused why it's needed at all. Other profiles that add a
> dependency seem OK, but something's different here.
>
> One thing we can/should change is to simply remove the
>  block in the profile. It should always be a direct
> dep in Scala 2.13 (which lets us take out the profiles in submodules, which
> just repeat that)
> We can also update the version, by the by.
>
> I tried this and the resulting POM still doesn't look like what I expect
> though.
>
> (The binary release is OK, FWIW - it gets pulled in as a JAR as expected)
>
> On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy 
> wrote:
>
>> Hi Sean,
>>
>> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
>> help you out here.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>>
>> OK right, you would have seen a different error otherwise.
>>
>> Yes profiles are only a compile-time thing, but they should affect the
>> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
>> scala-parallel-collections as a dependency in the POM as expected (not in a
>> profile). However I see what you see in the .pom in the release repo, and
>> in my local repo after building - it's just sitting there as a profile as
>> if it weren't activated or something.
>>
>> I'm confused then, that shouldn't be what happens. I'd say maybe there is
>> a problem with the release script, but seems to affect a simple local
>> build. Anyone else more expert in this see the problem, while I try to
>> debug more?
>> The binary distro may actually be fine, I'll check; it may even not
>> matter much for users who generally just treat Spark as a compile-time-only
>> dependency either. But I can see it would break exactly your case,
>> something like a self-contained test job.
>>
>> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy 
>> wrote:
>>
>>> I did indeed.
>>>
>>> The generated spark-core_2.13-3.2.0.pom that is created alongside the
>>> jar file in the local repo contains:
>>>
>>> 
>>>   scala-2.13
>>>   
>>> 
>>>   org.scala-lang.modules
>>>
>>> scala-parallel-collections_${scala.binary.version}
>>> 
>>>   
>>> 
>>>
>>> which means this dependency will be missing for unit tests that create
>>> SparkSessions from library code only, a technique inspired by Spark’s own
>>> unit tests.
>>>
>>> Cheers,
>>>
>>> Steve C
>>>
>>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>>
>>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first
>>> to update POMs. It works fine for me.
>>>
>>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>>> s...@infomedia.com.au.invalid> wrote:
>>>
 Hi all,

 Being adventurous I have built the RC1 code with:

 -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
 -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2


 And then attempted to build my Java based spark application.

 However, I found a number of our unit tests were failing with:

 java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport

 at
 org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
 at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
 at
 org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
 at
 org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
 …


 I tracked this down to a missing dependency:

 
   org.scala-lang.modules

 scala-parallel-collections_${scala.binary.version}
 


 which unfortunately appears only in a profile in the pom files
 associated with the various spark dependencies.

 As far as I know it is not possible to activate profiles in
 dependencies in maven builds.

 Therefore I suspect that right now a Scala 2.13 migration is not quite
 as seamless as we would like.

 I stress that this is only an issue for developers that write unit
 tests for their applications, as the Spark runtime environment will always
 have the necessary dependencies available to it.

 (You might consider upgrading the
 org.scala-lang.modules:scala-parallel-collections_2.13 version from 0.2 to
 1.0.3 though!)

 

Discuss about current yarn client mode problem

2021-08-31 Thread angers zhu
Hi devs,

In current yarn-client mode, we have several problem,

   1. When AM lost connection with driver, it will just finish
   application with final status of SUCCESS, then
   YarnClientSchedulerBackend.MonitorThread will got application status with
   SUCCESS final status and then call sc.stop().  SparkContext stoped and
   program exit with a 0 exit code. For scheduler system, always use the exit
   code to judge if the application is successful. This make scheduler system
   and user don't know the job is failed.
   2. In YarnClientSchedulerBackend.MonitorThread, even it got a yarn
   report with FAILED or KILLED final status. It just call sc.stop(), make
   program exit with code 0. When some user killed a wrong application, the
   real owner of the killed application still got a wrong SUCCESS status of it
   's job.

There are some history discuss on these two problem SPARK-3627
 SPARK-1516
. But that was the result
of a very early discussion. Now spark is widely used by various companies,
and a lot of spark-related job scheduling systems have been developed
accordingly. These problem make user confused and hard to manage their
jobs.

Hope to get more feedback from the developers, or is there any good way to
avoid these problems.

Below are some of my related pr about these two problems:
https://github.com/apache/spark/pull/33780
https://github.com/apache/spark/pull/33780

Best regards
Angers