Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-12 Thread Sameer Agarwal
I'll start the vote with a +1.

As of today, all known release blockers and QA tasks have been resolved,
and the jenkins builds are healthy:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

On 12 February 2018 at 22:30, Sameer Agarwal  wrote:

> Now that all known blockers have once again been resolved, please vote on
> releasing the following candidate as Apache Spark version 2.3.0. The vote
> is open until Friday February 16, 2018 at 8:00:00 am UTC and passes if a
> majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc3: https://github.com/apache/
> spark/tree/v2.3.0-rc3 (89f6fcbafcfb0a7aeb897fba6036cb085bd35121)
>
> List of JIRA tickets resolved in this release can be found here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1264/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-
> docs/_site/index.html
>
>
> FAQ
>
> ===
> What are the unresolved issues targeted for 2.3.0?
> ===
>
> Please see https://s.apache.org/oXKi. At the time of writing, there are
> currently no known release blockers.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.0?
> ===
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
> appropriate.
>
> ===
> Why is my bug not fixed?
> ===
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said, if
> there is something which is a regression from 2.2.0 and has not been
> correctly targeted please ping me or a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.3.0 at
> https://s.apache.org/WmoI).
>
>
> Regards,
> Sameer
>


[VOTE] Spark 2.3.0 (RC3)

2018-02-12 Thread Sameer Agarwal
Now that all known blockers have once again been resolved, please vote on
releasing the following candidate as Apache Spark version 2.3.0. The vote
is open until Friday February 16, 2018 at 8:00:00 am UTC and passes if a
majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc3:
https://github.com/apache/spark/tree/v2.3.0-rc3
(89f6fcbafcfb0a7aeb897fba6036cb085bd35121)

List of JIRA tickets resolved in this release can be found here:
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1264/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.2.0. That being said, if
there is something which is a regression from 2.2.0 and has not been
correctly targeted please ping me or a committer to help target the issue
(you can see the open issues listed as impacting Spark 2.3.0 at
https://s.apache.org/WmoI).


Regards,
Sameer


Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-12 Thread sujith71955
Hi Folks,
I observed that in spark 2.2.x version we are using NimbusDS JOSE JWT jar
3.9 version, but i saw few vulnerability has been reported for this
particular version jar. please refer below details
https://nvd.nist.gov/vuln/detail/CVE-2017-12973,
https://www.cvedetails.com/cve/CVE-2017-12972/

As per details this vulnerability is been detected prior to 4.39 jars, we
are planning to upgrade  this jar.
Just wanted to know that is their any reason why this jar has not been
upgraded in community release as this consists of vulnerabilities.

Appreciate your suggestions.

Thanks,
Sujith 






--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Corrupt parquet file

2018-02-12 Thread Ryan Blue
I wouldn't say we have a primary failure mode that we deal with. What we
concluded was that all the schemes we came up with to avoid corruption
couldn't cover all cases. For example, what about when memory holding a
value is corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from
our clusters and let Amazon know to remove the instance from the hardware
pool. We also structure our ETL so we have some time to reprocess.

rb

On Mon, Feb 12, 2018 at 11:49 AM, Steve Loughran 
wrote:

>
>
> On 12 Feb 2018, at 19:35, Dong Jiang  wrote:
>
> I got no error messages from EMR. We write directly from dataframe to S3.
> There doesn’t appear to be an issue with S3 file, we can still down the
> parquet file and read most of the columns, just one column is corrupted in
> parquet.
> I suspect we need to write to HDFS first, make sure we can read back the
> entire data set, and then copy from HDFS to S3. Any other thoughts?
>
>
>
> The s3 object store clients mostly buffer to local temp fs before they
> write, at least all the ASF connectors do, so that data can be PUT/POSTed
> in 5+MB blocks, without requiring enough heap to buffer all data written by
> all threads. That's done to file://, not HDFS. Even if you do that copy up
> later from HDFS to S3, there's still going to be that local HDD buffering:
> it's not going to fix the problem —not if this really is corrupted local
> HDD data
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Drop the Hadoop 2.6 profile?

2018-02-12 Thread Steve Loughran
I'd advocate 2.7 over 2.6, primarily due to Kerberos and JVM versions


2.6 is not even qualified for Java 7, let alone Java 8: you've got no 
guarantees that things work on the min Java version Spark requires.

Kerberos is always the failure point here, as well as various libraries (jetty) 
which get used more on the server.

Except Guava, which gets everywhere and whose Java version policy  is only 
slightly more stable as its binary compatibility.

if tests aren't seeing those problems, it may mean that Kerberos is avoided, 
which is always nice to do, but it'll find you later

see HADOOP-11287, HADOOP-12716 (2.8+ only, presumably backported to CDH and 
HDP), HADOOP-10786 (which is in 2.6.1).

-Steve



On 8 Feb 2018, at 22:30, Koert Kuipers 
> wrote:

w
​ire compatibility is relevant if hadoop is included in spark build


for those of us that build spark without hadoop included hadoop (binary) api 
compatibility matters. i wouldn't want to build against hadoop 2.7 and deploy 
on hadoop 2.6, but i am ok the other way around. so to get the compatibility 
with all the major distros and cloud providers building against hadoop 2.6 is 
currently the way to go.


On Thu, Feb 8, 2018 at 5:09 PM, Marcelo Vanzin 
> wrote:
I think it would make sense to drop one of them, but not necessarily 2.6.

It kinda depends on what wire compatibility guarantees the Hadoop
libraries have; can a 2.6 client talk to 2.7 (pretty certain it can)?
Is the opposite safe (not sure)?

If the answer to the latter question is "no", then keeping 2.6 and
dropping 2.7 makes more sense. Those who really want a
Hadoop-version-specific package can override the needed versions in
the command line, or use the "without hadoop" package.

But in the context of trying to support 3.0 it makes sense to drop one
of them, at least from jenkins.


On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen 
> wrote:
> That would still work with a Hadoop-2.7-based profile, as there isn't
> actually any code difference in Spark that treats the two versions
> differently (nor, really, much different between 2.6 and 2.7 to begin with).
> This practice of different profile builds was pretty unnecessary after 2.2;
> it's mostly vestigial now.
>
> On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers 
> > wrote:
>>
>> CDH 5 is still based on hadoop 2.6
>>
>> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen 
>> > wrote:
>>>
>>> Mostly just shedding the extra build complexity, and builds. The primary
>>> little annoyance is it's 2x the number of flaky build failures to examine.
>>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
>>> sure there is anything compelling.
>>>
>>> It's something that probably gains us virtually nothing now, but isn't
>>> too painful either.
>>> I think it will not make sense to distinguish them once any Hadoop
>>> 3-related support comes into the picture, and maybe that will start soon;
>>> there were some more pings on related JIRAs this week. You could view it as
>>> early setup for that move.
>>>
>>>
>>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin 
>>> > wrote:

 Does it gain us anything to drop 2.6?

 > On Feb 8, 2018, at 10:50 AM, Sean Owen 
 > > wrote:
 >
 > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
 > fairly old, and actually, not different from 2.7 with respect to Spark. 
 > That
 > is, I don't know if we are actually maintaining anything here but a 
 > separate
 > profile and 2x the number of test builds.
 >
 > The cost is, by the same token, low. However I'm floating the idea of
 > removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>
>>
>



--
Marcelo




Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran


On 12 Feb 2018, at 19:35, Dong Jiang 
> wrote:

I got no error messages from EMR. We write directly from dataframe to S3. There 
doesn’t appear to be an issue with S3 file, we can still down the parquet file 
and read most of the columns, just one column is corrupted in parquet.
I suspect we need to write to HDFS first, make sure we can read back the entire 
data set, and then copy from HDFS to S3. Any other thoughts?


The s3 object store clients mostly buffer to local temp fs before they write, 
at least all the ASF connectors do, so that data can be PUT/POSTed in 5+MB 
blocks, without requiring enough heap to buffer all data written by all 
threads. That's done to file://, not HDFS. Even if you do that copy up later 
from HDFS to S3, there's still going to be that local HDD buffering: it's not 
going to fix the problem —not if this really is corrupted local HDD data


Re: Corrupt parquet file

2018-02-12 Thread Dong Jiang
I got no error messages from EMR. We write directly from dataframe to S3. There 
doesn’t appear to be an issue with S3 file, we can still down the parquet file 
and read most of the columns, just one column is corrupted in parquet.
I suspect we need to write to HDFS first, make sure we can read back the entire 
data set, and then copy from HDFS to S3. Any other thoughts?

From: Steve Loughran 
Date: Monday, February 12, 2018 at 2:27 PM
To: "rb...@netflix.com" 
Cc: Dong Jiang , Apache Spark Dev 
Subject: Re: Corrupt parquet file

What failure mode is likely here?

As the uploads are signed, the network payload is not corruptible from the 
moment its written into the HTTPS request, which places it earlier

* RAM corruption which ECC doesn't pick up. It'd be interesting to know what 
stats & health checks AWS run here, such as, say, low-intensity RAM checks when 
VM space is idle.
* Any temp files buffering the blocks to HDD are being corrupted, which could 
happen with faulty physical disk? Is that likely?
* S3 itself is in trouble.

I don't see any checksum verification of disk0buffered block data before it is 
uploaded to S3: the files are just handed straight off to the AWS SDK. I could 
certainly force that through the hadoop CRC check sequence, but that 
complicates retransmission as well as performance.

What could work would be to build the MD5 sum of each block as it is written 
from spark to buffer, then verify that the returned etag of that POST/PUT 
matches the original value, That'd to end-to-end error checking from the JVM 
ram all the way to S3, leaving VM ECC and S3 itself as the failure points.

-Steve

my old work on this: 
https://www.slideshare.net/steve_l/did-you-reallywantthatdata



On 5 Feb 2018, at 18:41, Ryan Blue 
> wrote:

In that case, I'd recommend tracking down the node where the files were created 
and reporting it to EMR.

On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang 
> wrote:
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster 
failed. However, in this particular case, the cluster succeeded, not reporting 
any errors. I was able to null out the corrupted the column and recover the 
rest of the 133 columns. I do feel the issue is more than 1-2 occurrences a 
year. This is the second time, I am aware of the issue within a month, and we 
certainly don’t run as large data infrastructure compared to Netflix.

I will keep an eye on this issue.

Thanks,

Dong

From: Ryan Blue >
Reply-To: "rb...@netflix.com" 
>
Date: Monday, February 5, 2018 at 1:34 PM

To: Dong Jiang >
Cc: Spark Dev List >
Subject: Re: Corrupt parquet file


We ensure the bad node is removed from our cluster and reprocess to replace the 
data. We only see this once or twice a year, so it isn't a significant problem.

We've discussed options for adding write-side validation, but it is expensive 
and still unreliable if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang 
> wrote:
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 
134 columns in the file. The issue seems only impact one column, and very hard 
to detect. It seems you have encountered this issue before, what do you do to 
prevent a recurrence?

Thanks,

Dong

From: Ryan Blue >
Reply-To: "rb...@netflix.com" 
>
Date: Monday, February 5, 2018 at 12:46 PM

To: Dong Jiang >
Cc: Spark Dev List >
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the 
write task ran. Maybe you can get an instance ID and open a ticket with Amazon. 
Otherwise, it will probably start failing the HW checks when the instance 
hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least to 
the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to indicate 
actual job success (you could do other tasks after that fail) and it carries no 
guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang 
> wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events 
suggests any issues with 

Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran
What failure mode is likely here?

As the uploads are signed, the network payload is not corruptible from the 
moment its written into the HTTPS request, which places it earlier

* RAM corruption which ECC doesn't pick up. It'd be interesting to know what 
stats & health checks AWS run here, such as, say, low-intensity RAM checks when 
VM space is idle.
* Any temp files buffering the blocks to HDD are being corrupted, which could 
happen with faulty physical disk? Is that likely?
* S3 itself is in trouble.

I don't see any checksum verification of disk0buffered block data before it is 
uploaded to S3: the files are just handed straight off to the AWS SDK. I could 
certainly force that through the hadoop CRC check sequence, but that 
complicates retransmission as well as performance.

What could work would be to build the MD5 sum of each block as it is written 
from spark to buffer, then verify that the returned etag of that POST/PUT 
matches the original value, That'd to end-to-end error checking from the JVM 
ram all the way to S3, leaving VM ECC and S3 itself as the failure points.

-Steve

my old work on this: 
https://www.slideshare.net/steve_l/did-you-reallywantthatdata


On 5 Feb 2018, at 18:41, Ryan Blue 
> wrote:

In that case, I'd recommend tracking down the node where the files were created 
and reporting it to EMR.

On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang 
> wrote:
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster 
failed. However, in this particular case, the cluster succeeded, not reporting 
any errors. I was able to null out the corrupted the column and recover the 
rest of the 133 columns. I do feel the issue is more than 1-2 occurrences a 
year. This is the second time, I am aware of the issue within a month, and we 
certainly don’t run as large data infrastructure compared to Netflix.

I will keep an eye on this issue.

Thanks,

Dong

From: Ryan Blue >
Reply-To: "rb...@netflix.com" 
>
Date: Monday, February 5, 2018 at 1:34 PM

To: Dong Jiang >
Cc: Spark Dev List >
Subject: Re: Corrupt parquet file


We ensure the bad node is removed from our cluster and reprocess to replace the 
data. We only see this once or twice a year, so it isn't a significant problem.

We've discussed options for adding write-side validation, but it is expensive 
and still unreliable if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang 
> wrote:
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 
134 columns in the file. The issue seems only impact one column, and very hard 
to detect. It seems you have encountered this issue before, what do you do to 
prevent a recurrence?

Thanks,

Dong

From: Ryan Blue >
Reply-To: "rb...@netflix.com" 
>
Date: Monday, February 5, 2018 at 12:46 PM

To: Dong Jiang >
Cc: Spark Dev List >
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the 
write task ran. Maybe you can get an instance ID and open a ticket with Amazon. 
Otherwise, it will probably start failing the HW checks when the instance 
hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least to 
the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to indicate 
actual job success (you could do other tasks after that fail) and it carries no 
guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang 
> wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events 
suggests any issues with the cluster or the nodes. We also see the _SUCCESS 
file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue >
Reply-To: "rb...@netflix.com" 
>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang >
Cc: Spark Dev List >
Subject: Re: Corrupt parquet file

Dong,

We see this from time to