[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214576#comment-15214576 ] Steve Loughran commented on SPARK-7481: --- I've created a pull request on this, which has # a new module (tentative name, `spark-cloud`) which has transitive dependencies on the hadoop and amazon/microsoft JARs # a dependency in spark assembly on the module and the hadoop JARs, *excluding those amazon/microsoft JARs*. This re-instances s3n, adds s3a, swift and (hadoop 2.7+) wasb support. For s3a and wasb, you will need to add the external JAR during job submission. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214574#comment-15214574 ] Apache Spark commented on SPARK-7481: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/12004 > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197451#comment-15197451 ] Nicholas Chammas commented on SPARK-7481: - (Sorry Steve; can't comment on your proposal since I don't know much about these kinds of build decisions.) Just to add some more evidence to the record that this problem appears to affect many people, take a look at this: http://stackoverflow.com/search?q=%5Bapache-spark%5D+S3+Hadoop+2.6 Lots of confusion about how to access S3, with the recommended solution as before being to [use Spark built against Hadoop 2.4|http://stackoverflow.com/a/30852341/877069]. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197165#comment-15197165 ] Steve Loughran commented on SPARK-7481: --- ...thinking some more about this How about # adding a {{spark-cloud}} module which, initially, does nothing but declare the dependencies on {{hadoop-aws}}, {{hadoop-openstack}}, and on 2.7+, {{hadoop-azure}}. # have spark assembly declare a dependency on this module, but explicitly excluding all dependencies other than the hadoop ones (i.e. no amazon libs, no extra httpclient ones for openstack (if there are any), anything azure wants). If someone wants to add the relevant amazon libs, they need to explicitly add it on the {{--jars}} option. Doing it this way means that if a project depends on {{spark-cloud}} it gets all the cloud dependencies that version of spark+hadoop needs. It also provides a placeholder for explicit cloud support, specifically - output committers that don't try to rename/assume that directory delete is atomic and O(1) - some optional tests/examples to read/write data. The tests would be good not just for spark, but for catching regressions in hadoop/aws/azure code. If people think this is good, assign it to me and I'll look at it in april > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177922#comment-15177922 ] Steve Loughran commented on SPARK-7481: --- For comparison, the full AWS SDK is 13MB; the s3 SDK 570K, so something that could possibly be added. But adding it does set up an implicit commitment to keep it there, would lead to discussion about why not azure, google gfs, Saying "add the aws-s3-sdk JAR if you want it" avoids making any such commitment. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177918#comment-15177918 ] Steve Loughran commented on SPARK-7481: --- Longer term, having spark_home /lib/*.jar is the best, general purpose solution. For now, {{hadoop-aws}} can be added to the existing 2.6 profile, explicitly excluding the full amazon SDK jar. This would give s3n back to the code. Jets3t is still in the spark-assembly JAR today. If built with 2.6.x, you'd get s3n and, if you added the full aws-SDK JAR with --addjars, S3a support If you built with 2.7.x (e.g {{-Dhadoop=version=2.7.2}}) you'd get, s3n, s3a and, implicitly, the (much smaller) {{amazon-s3-sdk}} JAR needed to talk with S3. Users wouldn't need to add the amazon-aws-sdk.jar to the submission (it would cause link problems if they tried). ..Or, to keep the assembly JAR small, {{amazon-s3-sdk}} could also be excluded. This would add the ASF classes, but you'd always need to add the right JAR for the hadoop version you compiled against (Amazon changed a parameter from an int to a long in a method, see) > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177623#comment-15177623 ] Steve Loughran commented on SPARK-7481: --- Hadoop 2.6 added S3a, which we put into a new hadoop-tools/hadoop-aws JAR, along with a dependency on sun's {{aws-java}} SDK. Someone other than myself went and moved the existing S3n classes into the same JAR. If'd I'd seen that, i'd have -1'd it, but I didn't notice until 2.6 shipped. as stated, I wouldn't use S3a in Hadoop 2.6.x. HADOOP-11571 contains the reasons. It wasn't until Hadoop 2.7 that it became ready for serious use. Both come in hadoop-aws; s3a needs an amazon JAR, which must be matched precisely with the version used in the hadoop library. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176559#comment-15176559 ] Nicholas Chammas commented on SPARK-7481: - I'm not comfortable working with Maven so I can't comment on the details of the approach we should take, but I will appreciate any progress towards making Spark built against Hadoop 2.6+ work with S3 out of the box, or as close to out of the box as possible. Given Spark's close relation to S3 and EC2 (as far as Spark's user base is concerned), a good out of the box experience here is critical. Many people just expect it. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176551#comment-15176551 ] Nicholas Chammas commented on SPARK-7481: - {quote} One issue here that hadoop 2.6's hadoop-aws pulls in the whole AWT toolkit, which is pretty weighty, for s3a ... which isn't something I'd use in 2.6 anyway. {quote} Did you mean something other than s3a here? > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175664#comment-15175664 ] Steve Loughran commented on SPARK-7481: --- One issue here that hadoop 2.6's hadoop-aws pulls in the whole AWT toolkit, which is pretty weighty, for s3a ... which isn't something I'd use in 2.6 anyway. Hadoop 2.7 moved to the (link-time-incompatible) amazon-s3 JAR, also adds hadoop-azure with some wasb JAR. And in Hadoop 2.7 onwards,. s3a is the one i would run to use in preference to s3n. What might work is a hadoop 2.6 profile which explicity adds hadoop-aws, then excludes the amazon sdk {code} com.amazonaws aws-java-sdk compile {code} This would automatically pick up the {{aws-java-sdk-s3}} JAR on a 2.7+ build, because it's not excluded by name. Though then there's fun if you try to add the {{aws-java-sdk-s3}} JAR needed for Hadoop 2.6 to the classpath, as it won't link. Which makes me think that excluding {{aws-java-sdk-s3}} would be safer. The hadoop code to talk to s3a and s3n would be there, s3n would work as well/badly as it always does, and for s3a you'd need to add the right aws JAR for your hadoop version > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174442#comment-15174442 ] Peng Cheng commented on SPARK-7481: --- +1 Me four > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174438#comment-15174438 ] Nicholas Chammas commented on SPARK-7481: - Many people seem to be downgrading to use Spark built against Hadoop 2.4 because the Spark / Hadoop 2.6 package doesn't work against S3 out of the box. * [Example 1|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14582965=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14582965] * [Example 2|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14903750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14903750] * [Example 3|https://github.com/nchammas/flintrock/issues/88#issuecomment-190905262] If this proposal eliminates that bit of friction for users without being too burdensome on the team, then I'm for it. Ideally, we want people using Spark built against the latest version of Hadoop anyway, right? This proposal would nudge people in that direction. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156101#comment-15156101 ] Yardena commented on SPARK-7481: +1, please add this. lib/* approach would be great, or a profile like initially suggested (which is what we do manually right now). Thanks. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126175#comment-15126175 ] Steve Loughran commented on SPARK-7481: --- having a lib/* would be fantastic, as it stops spark having to worry about the details, or explain to users what they have to do. whoever wants to use WASB. google fs or s3a would have to put in the relevant JARs, both hadoop ones and third party, but they could either do that themselves or spark/bigtop could add the profile > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125566#comment-15125566 ] Marcelo Vanzin commented on SPARK-7481: --- It doesn't necessarily affect this proposal. It would make it easier to have people add these separately - just drop the jars in Spark's "lib" directory and suddenly they're part of Spark. But if you don't add the dependency explicitly in Spark's build, they'll not be included in Spark's packaging, so there would still be a manual step to add support for those backends. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125143#comment-15125143 ] Josh Rosen commented on SPARK-7481: --- How does this proposal change if we just remove the assembly and ship a folder of JARs, as has been proposed elsewhere by [~vanzin]? Does that render this proposal moot? > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085477#comment-15085477 ] Steve Loughran commented on SPARK-7481: --- Josh, there is a 2.6 profile —but all it currently does is bump up the dependencies of other things (jets3t, curator, etc). It doesn't pull in hadoop-aws, which is where the s3a, s3n stuff lives, or the amazon JAR which is needed for s3a to work (the fact that s3n moved to the new JAR was something somebody else did; I've have probably vetoed it if I'd noticed). the amazon JAR in Hadoop 2.6, `aws-java-sdk` is huge, and not something you'd want in the spark assembly. Hadoop 2.7+ has switched to to the leaner aws-java-sdk-s3; HADOOP-12269 has shown how that's been a bit brittle over versions. Pulling in all the amazon SDK bits into the assembly jar is something that could be done if targeting Hadoop 2.7+, but you'd need care to make sure that the exact amazon lib that Hadoop was built against is used. It'd be easier if # `bin\spark-class` (and transitively, things like the yarn launcher) grabbed *.jar from the Spark lib dir, so all people would need to do is drop in the appropriate aws JAR (or for azure, the MSFT azure JAR) # the 2.6 profile added hadoop-aws to the dependencies of the spark assembly (and hadoop-openstack) # a 2.7 profile added hadoop-azure that is: the hadoop code is used (all fairly thin), but the third party JARs are left out This would mean the assembly had all the Hadoop stuff, and all people needed to do was drop in the external jirs to the lib directory What do you think? > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083646#comment-15083646 ] Josh Rosen commented on SPARK-7481: --- Hey, is this task done? I see that we have a {{hadoop2.6}} profile now. > Add Hadoop 2.6+ profile to pull in object store FS accessors > > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.3.1 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) > this adds more stuff to the client bundle, but will mean a single spark > package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703025#comment-14703025 ] Thomas Demoor commented on SPARK-7481: -- [~srowen] and [~pc...@uowmail.edu.au]: with HADOOP-12269 merged in s3a only needs aws-sdk-core, aws-sdk-kms and aws-sdk-s3, with combined size of ~1.4MB (down from 11.5MB), updated dependencies, no Kinesis, ... Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703585#comment-14703585 ] Peng Cheng commented on SPARK-7481: --- Thanks a lot! A long run to the end. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647406#comment-14647406 ] Thomas Demoor commented on SPARK-7481: -- Pulled the aws-upgrade out of HADOOP-11684 to a separate issue HADOOP-12269. Only uses aws-sdk-s3-1.10.6 instead of the entire sdk. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563033#comment-14563033 ] Thomas Demoor commented on SPARK-7481: -- [~srowen] and [~ste...@apache.org], for s3a, things should improve in future Hadoop versions. I have a first patch set up for [HADOOP-11684] that also bumps the aws-sdk version to a recent version. From 1.9 onwards, you can pull in individual components separately. For s3a, we only need s3 (and evidently the core lib) which solves both the large size and the kinesis issue. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557549#comment-14557549 ] Peng Cheng commented on SPARK-7481: --- I've tried to do it but I get a lot of headaches, as aws toolkit is using an outdate jackson library. Though this feature is indeed blocking me from upgrading to hadoop 2.6. So I guess its important Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534168#comment-14534168 ] Sean Owen commented on SPARK-7481: -- Yikes, that seems like a load of stuff to pull in. Can't this / shouldn't this be added by the end user if desired? Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534174#comment-14534174 ] Steve Loughran commented on SPARK-7481: --- This doesn't contain any endorsement of the use of s3a in Hadoop 2.6; see HADOOP-11571 I'm not planning to add any tests for this, but its something to consider for regression testing all the object stores —the tests just need to: * be skipped if there's no credentials * make a best effort to stop anyone accidentally checking in their credentials * work on deskop/jenkins rather than just on cloud. * not run up massive bills * not take forever AWS publishes some free-to-read datasets, such as [this one|http://datasets.elasticmapreduce.s3.amazonaws.com/] which won't need credentials, work remote and don't ring up bills for the read part of the process, but would take a long time to complete on a single executor. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534210#comment-14534210 ] Sean Owen commented on SPARK-7481: -- Maybe I'd be less frightened if I knew the size of these deps and their dependencies was small, and the licenses were all OK, etc. This would need some checking; I know we had a license problem and so forth with Kinesis, and have had jets3t problems, etc. I am maybe needlessly wary of doing this several times over to add more niche FS clients to the main build for everyone. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534622#comment-14534622 ] Steve Loughran commented on SPARK-7481: --- hadoop openstack 100K +httpclient (400K) hadoop-aws : 85K, jetset 500K s3a needs the aws toolkit @ 11.5MB, so it's the big one azure is 500K. to retain s3n in spark, the hadoop-aws and jetset dependency needs to go in; s3a is a fairly large additions Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org