Re: RepartitionByKey Behavior

2018-06-26 Thread Chawla,Sumit
Thanks everyone.  As Nathan suggested,  I ended up collecting the distinct
keys first and then assigning Ids to each key explicitly.

Regards
Sumit Chawla


On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:

> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit 
 wrote:

> Hi
>
>  I have been trying to this simple operation.  I want to land all
> values with one key in same partition, and not have any different key in
> the same partition.  Is this possible?   I am getting b and c always
> getting mixed up in the same partition.
>
>
>
> I think you could do something approsimately like:
>
>  val keys = rdd.map(_.getKey).distinct.zipWithIndex
>  val numKey = keys.map(_._2).count
>  rdd.map(r => (r.getKey, r)).join(keys).partitionBy(new Partitioner()
> {def numPartitions=numKeys;def getPartition(key: Any) =
> key.asInstanceOf[Long].toInt})
>
> i.e., key by a unique number, count that, and repartition by key to the
> exact count.  This presumes, of course, that the number of keys is 
> Also, I haven't tested this code, so don't take it as anything more than
> an approximate idea, please :-)
>
>  -Nathan Kronenfeld
>


[VOTE] Spark 2.1.3 (RC2)

2018-06-26 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.1.3.

The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.1.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
https://github.com/apache/spark/tree/v2.1.3-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1275/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/

The list of bug fixes going into 2.1.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12341660

Notes:

- RC1 was not sent for a vote. I had trouble building it, and by the time I got
  things fixed, there was a blocker bug filed. It was already tagged in git
  at that time.

- If testing the source package, I recommend using Java 8, even though 2.1
  supports Java 7 (and the RC was built with JDK 7). This is because Maven
  Central has updated some configuration that makes the default Java 7 SSL
  config not work.

- There are Maven artifacts published for Scala 2.10, but binary
releases are only
  available for Scala 2.11. This matches the previous release (2.1.2),
but if there's
  a need / desire to have pre-built distributions for Scala 2.10, I can probably
  amend the RC without having to create a new one.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.1.3?
===

The current list of open tickets targeted at 2.1.3 can be found at:
https://s.apache.org/spark-2.1.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-26 Thread Marcelo Vanzin
Starting with my own +1.

On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.1.3.
>
> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.1.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
> https://github.com/apache/spark/tree/v2.1.3-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1275/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>
> The list of bug fixes going into 2.1.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>
> Notes:
>
> - RC1 was not sent for a vote. I had trouble building it, and by the time I 
> got
>   things fixed, there was a blocker bug filed. It was already tagged in git
>   at that time.
>
> - If testing the source package, I recommend using Java 8, even though 2.1
>   supports Java 7 (and the RC was built with JDK 7). This is because Maven
>   Central has updated some configuration that makes the default Java 7 SSL
>   config not work.
>
> - There are Maven artifacts published for Scala 2.10, but binary
> releases are only
>   available for Scala 2.11. This matches the previous release (2.1.2),
> but if there's
>   a need / desire to have pre-built distributions for Scala 2.10, I can 
> probably
>   amend the RC without having to create a new one.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.1.3?
> ===
>
> The current list of open tickets targeted at 2.1.3 can be found at:
> https://s.apache.org/spark-2.1.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



hadoop-aws versions (was Re: [VOTE] Spark 2.3.1 (RC4))

2018-06-26 Thread Steve Loughran
following up after a ref to this in  
https://issues.apache.org/jira/browse/HADOOP-15559

the AWS SDK is a very fast moving project, with a release cycle of ~2 weeks, 
but it's in the state Fred Brooks described, "the number of bugs is constant, 
they just move around"; bumpin gup an AWS release is always fun ( 
https://issues.apache.org/jira/browse/HADOOP-14596 , and usually results in 1+ 
issue being raised with the aws SDK project, us doing a workaround & then a 
later release fixing that while adding something new


On 2 Jun 2018, at 02:51, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:


pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me either 
(even building with -Phadoop-2.7). I guess I’ve been relying on an unsupported 
pattern and will need to figure something else out going forward in order to 
use s3a://.


Ideally the ASF releases should be done with the -Phadoop-cloud option just to 
get the relevant spark-hadoop-cloud module into the ASF repo, at which point 
you could just depend on it  and get not just the things you need, but none of 
the things you don't, which is always the second half of the 
working-with-transitive-dependencies problem.

For Hadoop 2.8.x & Spark 2.3, you'll need to

* have hadoop-* consistent
* use the aws-sdk for that version: 
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.8.2
* revert out the httpclient updates of SPARK-22919
* exclude any declared jackson dependencies of the hadoop-aws/aws sdk modules 
(HADOOP-13692)
* make sure joda time >= 2.8.1+ is on the classpath else you can't authenticate 
with AWS on a JVM >= 8u51

this is why Hadoop 2.9+ has moved to a (very fat) shaded AWS SDK JAR; you only 
need to get hadoop-* and aws-sdk-bundle JAR in sync, at least provided the 
shaded JAR doesn't actually declare things 
(https://issues.apache.org/jira/browse/HADOOP-15264. We feel that pain too, see.

Anyway, sorry to hear of your suffering.

Nicholas, ping me direct if you are trying to debug things here

-steve

​

On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
mailto:van...@cloudera.com>> wrote:
I have personally never tried to include hadoop-aws that way. But at
the very least, I'd try to use the same version of Hadoop as the Spark
build (2.7.3 IIRC). I don't really expect a different version to work,
and if it did in the past it definitely was not by design.

On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
mailto:nicholas.cham...@gmail.com>> wrote:
> Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
> building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, so
> it appears something has changed since then.
>
> I wasn’t familiar with -Phadoop-cloud, but I can try that.
>
> My goal here is simply to confirm that this release of Spark works with
> hadoop-aws like past releases did, particularly for Flintrock users who use
> Spark with S3A.
>
> We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop builds with
> every Spark release. If the -hadoop2.7 release build won’t work with
> hadoop-aws anymore, are there plans to provide a new build type that will?


>
> Apologies if the question is poorly formed. I’m batting a bit outside my
> league here. Again, my goal is simply to confirm that I/my users still have
> a way to use s3a://. In the past, that way was simply to call pyspark
> --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very similar. If
> that will no longer work, I’m trying to confirm that the change of behavior
> is intentional or acceptable (as a review for the Spark project) and figure
> out what I need to change (as due diligence for Flintrock’s users).
>
> Nick
>
>
> On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin 
> mailto:van...@cloudera.com>> wrote:
>>
>> Using the hadoop-aws package is probably going to be a little more
>> complicated than that. The best bet is to use a custom build of Spark
>> that includes it (use -Phadoop-cloud). Otherwise you're probably
>> looking at some nasty dependency issues, especially if you end up
>> mixing different versions of Hadoop.
>>
>> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>> mailto:nicholas.cham...@gmail.com>> wrote:
>> > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4
>> > using
>> > Flintrock. However, trying to load the hadoop-aws package gave me some
>> > errors.
>> >
>> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>> >
>> > 
>> >
>> > :: problems summary ::
>> >  WARNINGS
>> > [NOT FOUND  ]
>> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>> >  local-m2-cache: tried
>> >
>> >
>> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar
>> > [NOT FOUND  ]
>> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
>> >  local-m2-cache: tried
>> >
>> >
>> > 

Spark model serving

2018-06-26 Thread Saikat Kanjilal
HoldenK and interested folks,
Am just following up on the spark model serving discussions as this is highly 
relevant to what I’m embarking on at work.  Is there a concrete list of next 
steps or can someone summarize what was discussed at the summit , would love to 
have a Seattle version of this discussion with some folks.

Look forward to hearing back and driving this.

Regards 

Sent from my iPhone
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org