Re: RepartitionByKey Behavior
Thanks everyone. As Nathan suggested, I ended up collecting the distinct keys first and then assigning Ids to each key explicitly. Regards Sumit Chawla On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld < nkronenfeld@uncharted.software> wrote: > On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit wrote: > Hi > > I have been trying to this simple operation. I want to land all > values with one key in same partition, and not have any different key in > the same partition. Is this possible? I am getting b and c always > getting mixed up in the same partition. > > > > I think you could do something approsimately like: > > val keys = rdd.map(_.getKey).distinct.zipWithIndex > val numKey = keys.map(_._2).count > rdd.map(r => (r.getKey, r)).join(keys).partitionBy(new Partitioner() > {def numPartitions=numKeys;def getPartition(key: Any) = > key.asInstanceOf[Long].toInt}) > > i.e., key by a unique number, count that, and repartition by key to the > exact count. This presumes, of course, that the number of keys is > Also, I haven't tested this code, so don't take it as anything more than > an approximate idea, please :-) > > -Nathan Kronenfeld >
[VOTE] Spark 2.1.3 (RC2)
Please vote on releasing the following candidate as Apache Spark version 2.1.3. The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.1.3 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.1.3-rc2 (commit b7eac07b): https://github.com/apache/spark/tree/v2.1.3-rc2 The release files, including signatures, digests, etc. can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/ Signatures used for Spark RCs can be found in this file: https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1275/ The documentation corresponding to this release can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/ The list of bug fixes going into 2.1.3 can be found at the following URL: https://issues.apache.org/jira/projects/SPARK/versions/12341660 Notes: - RC1 was not sent for a vote. I had trouble building it, and by the time I got things fixed, there was a blocker bug filed. It was already tagged in git at that time. - If testing the source package, I recommend using Java 8, even though 2.1 supports Java 7 (and the RC was built with JDK 7). This is because Maven Central has updated some configuration that makes the default Java 7 SSL config not work. - There are Maven artifacts published for Scala 2.10, but binary releases are only available for Scala 2.11. This matches the previous release (2.1.2), but if there's a need / desire to have pre-built distributions for Scala 2.10, I can probably amend the RC without having to create a new one. FAQ = How can I help test this release? = If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. If you're working in PySpark you can set up a virtual env and install the current RC and see if anything important breaks, in the Java/Scala you can add the staging repository to your projects resolvers and test with the RC (make sure to clean up the artifact cache before/after so you don't end up building with a out of date RC going forward). === What should happen to JIRA tickets still targeting 2.1.3? === The current list of open tickets targeted at 2.1.3 can be found at: https://s.apache.org/spark-2.1.3 Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to an appropriate release. == But my bug isn't fixed? == In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from the previous release. That being said, if there is something which is a regression that has not been correctly targeted please ping me or a committer to help target the issue. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Spark 2.1.3 (RC2)
Starting with my own +1. On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.1.3. > > The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a > majority +1 PMC votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 2.1.3 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is v2.1.3-rc2 (commit b7eac07b): > https://github.com/apache/spark/tree/v2.1.3-rc2 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1275/ > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/ > > The list of bug fixes going into 2.1.3 can be found at the following URL: > https://issues.apache.org/jira/projects/SPARK/versions/12341660 > > Notes: > > - RC1 was not sent for a vote. I had trouble building it, and by the time I > got > things fixed, there was a blocker bug filed. It was already tagged in git > at that time. > > - If testing the source package, I recommend using Java 8, even though 2.1 > supports Java 7 (and the RC was built with JDK 7). This is because Maven > Central has updated some configuration that makes the default Java 7 SSL > config not work. > > - There are Maven artifacts published for Scala 2.10, but binary > releases are only > available for Scala 2.11. This matches the previous release (2.1.2), > but if there's > a need / desire to have pre-built distributions for Scala 2.10, I can > probably > amend the RC without having to create a new one. > > FAQ > > = > How can I help test this release? > = > > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC and see if anything important breaks, in the Java/Scala > you can add the staging repository to your projects resolvers and test > with the RC (make sure to clean up the artifact cache before/after so > you don't end up building with a out of date RC going forward). > > === > What should happen to JIRA tickets still targeting 2.1.3? > === > > The current list of open tickets targeted at 2.1.3 can be found at: > https://s.apache.org/spark-2.1.3 > > Committers should look at those and triage. Extremely important bug > fixes, documentation, and API tweaks that impact compatibility should > be worked on immediately. Everything else please retarget to an > appropriate release. > > == > But my bug isn't fixed? > == > > In order to make timely releases, we will typically not hold the > release unless the bug in question is a regression from the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. > > > -- > Marcelo -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
hadoop-aws versions (was Re: [VOTE] Spark 2.3.1 (RC4))
following up after a ref to this in https://issues.apache.org/jira/browse/HADOOP-15559 the AWS SDK is a very fast moving project, with a release cycle of ~2 weeks, but it's in the state Fred Brooks described, "the number of bugs is constant, they just move around"; bumpin gup an AWS release is always fun ( https://issues.apache.org/jira/browse/HADOOP-14596 , and usually results in 1+ issue being raised with the aws SDK project, us doing a workaround & then a later release fixing that while adding something new On 2 Jun 2018, at 02:51, Nicholas Chammas mailto:nicholas.cham...@gmail.com>> wrote: pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me either (even building with -Phadoop-2.7). I guess I’ve been relying on an unsupported pattern and will need to figure something else out going forward in order to use s3a://. Ideally the ASF releases should be done with the -Phadoop-cloud option just to get the relevant spark-hadoop-cloud module into the ASF repo, at which point you could just depend on it and get not just the things you need, but none of the things you don't, which is always the second half of the working-with-transitive-dependencies problem. For Hadoop 2.8.x & Spark 2.3, you'll need to * have hadoop-* consistent * use the aws-sdk for that version: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.8.2 * revert out the httpclient updates of SPARK-22919 * exclude any declared jackson dependencies of the hadoop-aws/aws sdk modules (HADOOP-13692) * make sure joda time >= 2.8.1+ is on the classpath else you can't authenticate with AWS on a JVM >= 8u51 this is why Hadoop 2.9+ has moved to a (very fat) shaded AWS SDK JAR; you only need to get hadoop-* and aws-sdk-bundle JAR in sync, at least provided the shaded JAR doesn't actually declare things (https://issues.apache.org/jira/browse/HADOOP-15264. We feel that pain too, see. Anyway, sorry to hear of your suffering. Nicholas, ping me direct if you are trying to debug things here -steve On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin mailto:van...@cloudera.com>> wrote: I have personally never tried to include hadoop-aws that way. But at the very least, I'd try to use the same version of Hadoop as the Spark build (2.7.3 IIRC). I don't really expect a different version to work, and if it did in the past it definitely was not by design. On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas mailto:nicholas.cham...@gmail.com>> wrote: > Building with -Phadoop-2.7 didn’t help, and if I remember correctly, > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, so > it appears something has changed since then. > > I wasn’t familiar with -Phadoop-cloud, but I can try that. > > My goal here is simply to confirm that this release of Spark works with > hadoop-aws like past releases did, particularly for Flintrock users who use > Spark with S3A. > > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop builds with > every Spark release. If the -hadoop2.7 release build won’t work with > hadoop-aws anymore, are there plans to provide a new build type that will? > > Apologies if the question is poorly formed. I’m batting a bit outside my > league here. Again, my goal is simply to confirm that I/my users still have > a way to use s3a://. In the past, that way was simply to call pyspark > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very similar. If > that will no longer work, I’m trying to confirm that the change of behavior > is intentional or acceptable (as a review for the Spark project) and figure > out what I need to change (as due diligence for Flintrock’s users). > > Nick > > > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin > mailto:van...@cloudera.com>> wrote: >> >> Using the hadoop-aws package is probably going to be a little more >> complicated than that. The best bet is to use a custom build of Spark >> that includes it (use -Phadoop-cloud). Otherwise you're probably >> looking at some nasty dependency issues, especially if you end up >> mixing different versions of Hadoop. >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas >> mailto:nicholas.cham...@gmail.com>> wrote: >> > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 >> > using >> > Flintrock. However, trying to load the hadoop-aws package gave me some >> > errors. >> > >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4 >> > >> > >> > >> > :: problems summary :: >> > WARNINGS >> > [NOT FOUND ] >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms) >> > local-m2-cache: tried >> > >> > >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar >> > [NOT FOUND ] >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms) >> > local-m2-cache: tried >> > >> > >> >
Spark model serving
HoldenK and interested folks, Am just following up on the spark model serving discussions as this is highly relevant to what I’m embarking on at work. Is there a concrete list of next steps or can someone summarize what was discussed at the summit , would love to have a Seattle version of this discussion with some folks. Look forward to hearing back and driving this. Regards Sent from my iPhone - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org