A confusing ClassNotFoundException error
Hi all, I encounter an error at spark 1.4.0, and I make an error example as following. Both of the code can run OK on spark-shell, but the second code encounter an error using spark-submit. The only different is that the second code uses a literal function in the map(). but the first code uses a defined function in the map(). Could anyone tell me why this error happen? Thanks ==the ok code = val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) val rdd = sc.parallelize(Range(0,1000)) def fff = (input: Int)=input+1 val cnt = rdd.map(i = Array(1,2,3)).map(arr = { arr.map(fff) }).count() print(cnt) ==not ok code val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) val rdd = sc.parallelize(Range(0,1000)) //def fff = (input: Int)=input+1 val cnt = rdd.map(i = Array(1,2,3)).map(arr = { arr.map(_+1) }).count() print(cnt) Submit command: spark-submit --class com.yhd.ycache.magic.Model --jars ./SSExample-0.0.2-SNAPSHOT-jar-with-dependencies.jar --master local ./SSExample-0.0.2-SNAPSHOT-jar-with-dependencies.jar the error info: Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$10$$anonfun$apply$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at com.yhd.ycache.magic.Model$.main(SSExample.scala:238) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?
Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com: Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com wrote: OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com: Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper but rather within Spark itself. If you want to be able to use zookeeper based monitoring tools to keep track of progress, then this is needed. In my specific case we need to persist Kafka offsets externally so that we can continue from where we left off after a code deployment. In other words, we need exactly-once processing guarantees across code deployments. Spark does not support any state persistence across deployments so this is something we need to handle on our own. Hope that helps. Let me know if not. Thanks! Amit On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the release of 1.4! I have been trying out the direct Kafka support in python but haven't been able to figure out how to get the offsets from the RDD. Looks like the documentation is yet to be updated to include Python examples ( https://spark.apache.org/docs/latest/streaming-kafka-integration.html). I am specifically looking for the equivalent of https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2. I tried digging through the python code but could not find anything related. Any pointers would be greatly appreciated. Thanks! Amit
Re: Contributing to pyspark
1, Yes, because the issues are in JIRA. 2. Nope, (at least as far as MLlib is concerned) because most if it are just wrappers to the underlying Scala functions or methods and are not implemented in pure Python. 3. I'm not sure about this. It seems to work fine for me! HTH On Fri, Jun 12, 2015 at 10:41 AM, Usman Ehtesham Gul uehtesha...@gmail.com wrote: Hello Manoj, First of all thank you for the quick reply. Just a couple of more things. I have started reading the link you provided; I will definitely filter JIRA with PySpark. Can you verify: 1) We fork from Github right? I ask because on Github, I see its mirrored and there are no issues section. I am assuming because that is done in Jira. 2) To contribute to PySpark, we will have to clone the whole project. But if our changes/contributions are only specific to pyspark, we can do those too without relying on core spark and other client libraries right? 3) I think the email u...@spark.apache.org is broken. I am getting email from mailer-dae...@apache.org that email could be sent to this address. Can you check this? Thank you again. Hope to hear from you soon. Usman On Jun 12, 2015, at 12:57 AM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Hi, Thanks for your interest in PySpark. The first thing is to have a look at the how to contribute guide https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and filter the JIRA's using the label PySpark. If you have your own improvement in mind, you can file your a JIRA, discuss and then send a Pull Request HTH Regards. On Fri, Jun 12, 2015 at 9:36 AM, Usman Ehtesham uehtesha...@gmail.com wrote: Hello, I am currently taking a course in Apache Spark via EdX ( https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x) and at the same time I try to look at the code for pySpark too. I wanted to ask, if ideally I would like to contribute to pyspark specifically, how can I do that? I do not intend to contribute to core Apache Spark any time soon (mainly because I do not know Scala) but I am very comfortable in Python. Any tips on how to contribute specifically to pyspark without being affected by other parts of Spark would be greatly appreciated. P.S.: I ask this because there is a small change/improvement I would like to propose. Also since I just started learning Spark, I would like to also read and understand the pyspark code as I learn about Spark. :) Hope to hear from you soon. Usman Ehtesham Gul https://github.com/ueg1990 -- Godspeed, Manoj Kumar, http://manojbits.wordpress.com http://goog_1017110195/ http://github.com/MechCoder -- Godspeed, Manoj Kumar, http://manojbits.wordpress.com http://goog_1017110195 http://github.com/MechCoder
Re: How to support dependency jars and files on HDFS in standalone cluster mode?
Would you mind to file a JIRA for this? Thanks! Cheng On 6/11/15 2:40 PM, Dong Lei wrote: I think in standalone cluster mode, spark is supposed to do: 1.Download jars, files to driver 2.Set the driver’s class path 3.Driver setup a http file server to distribute these files 4.Worker download from driver and setup classpath Right? But somehow, the first step fails. Even if I can make the first step works(use option1), it seems that the classpath in driver is not correctly set. Thanks Dong Lei *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Thursday, June 11, 2015 2:32 PM *To:* Dong Lei *Cc:* Dianfei (Keith) Han; dev@spark.apache.org *Subject:* Re: How to support dependency jars and files on HDFS in standalone cluster mode? Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add them to classpath, which is different from regular files. Cheng On 6/11/15 2:18 PM, Dong Lei wrote: Thanks Cheng, If I do not use --jars how can I tell spark to search the jars(and files) on HDFS? Do you mean the driver will not need to setup a HTTP file server for this scenario and the worker will fetch the jars and files from HDFS? Thanks Dong Lei *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Thursday, June 11, 2015 12:50 PM *To:* Dong Lei; dev@spark.apache.org mailto:dev@spark.apache.org *Cc:* Dianfei (Keith) Han *Subject:* Re: How to support dependency jars and files on HDFS in standalone cluster mode? Since the jars are already on HDFS, you can access them directly in your Spark application without using --jars Cheng On 6/11/15 11:04 AM, Dong Lei wrote: Hi spark-dev: I can not use a hdfs location for the “--jars” or “--files” option when doing a spark-submit in a standalone cluster mode. For example: Spark-submit … --jars hdfs://ip/1.jar …. hdfs://ip/app.jar (standalone cluster mode) will not download 1.jar to driver’s http file server(but the app.jar will be downloaded to the driver’s dir). I figure out the reason spark not downloading the jars is that when doing sc.addJar to http file server, the function called is Files.copy which does not support a remote location. And I think if spark can download the jars and add them to http file server, the classpath is not correctly set, because the classpath contains remote location. So I’m trying to make it work and come up with two options, but neither of them seem to be elegant, and I want to hear your advices: Option 1: Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix. This is not good because I think it breaks the scope of http file server. Option 2: Modify DriverRunner.downloadUserJar, let it download all the “--jars” and “--files” with the application jar. This sounds more reasonable that option 1 for downloading files. But this way I need to read the “spark.jars” and “spark.files” on downloadUserJar or DriverRunnder.start and replace it with a local path. How can I do that? Do you have a more elegant solution, or do we have a plan to support it in the furture? Thanks Dong Lei
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
+1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: When to expect UTF8String?
1. Custom aggregators that do map-side combine. This is something I'd hoping to add in Spark 1.5 2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number of parameters. I'm super open to suggestions here. Mind possibly opening a JIRA with a proposed interface?
RE: When to expect UTF8String?
We are using Expression for two things. 1. Custom aggregators that do map-side combine. 2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number of parameters. Are there methods we can use to convert to/from the internal representation in these cases? From: Michael Armbrust [mich...@databricks.com] Sent: Thursday, June 11, 2015 9:05 PM To: Zack Sampson Cc: dev@spark.apache.org Subject: Re: When to expect UTF8String? Through the DataFrame API, users should never see UTF8String. Expression (and any class in the catalyst package) is considered internal and so uses the internal representation of various types. Which type we use here is not stable across releases. Is there a reason you aren't defining a UDF instead? On Thu, Jun 11, 2015 at 8:08 PM, zsampson zsamp...@palantir.commailto:zsamp...@palantir.com wrote: I'm hoping for some clarity about when to expect String vs UTF8String when using the Java DataFrames API. In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was once a String is now a UTF8String. The comments in the file and the related commit message indicate that maybe it should be internal to SparkSQL's implementation. However, when I add a column containing a custom subclass of Expression, the row passed to the eval method contains instances of UTF8String. Ditto for AggregateFunction.update. Is this expected? If so, when should I generally know to deal with UTF8String objects? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.htmlhttps://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_When-2Dto-2Dexpect-2DUTF8String-2Dtp12710.htmld=BQMFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=JTeV6BsFY8hARUm33aoIqBdzwQIcuTioZt881I11O_Mm=03dQBm7iPTCL33eIdtabOwGkj02beDizwxfaDAv1Xhss=EhYOx1s29rjLhkJfDhjQ_9QFNdw0GZ_YxaV6ZiXuqase= Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?
Hi, If you want I would be happy to work in this. I have worked with KafkaUtils.createDirectStream before, in a pull request that wasn't accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python and I'm starting to feel comfortable with Scala, so if someone opens a JIRA I can take it. Greetings, Juan Rodriguez 2015-06-12 15:59 GMT+02:00 Cody Koeninger c...@koeninger.org: The scala api has 2 ways of calling createDirectStream. One of them allows you to pass a message handler that gets full access to the kafka MessageAndMetadata, including offset. I don't know why the python api was developed with only one way to call createDirectStream, but the first thing I'd look at would be adding that functionality back in. If someone wants help creating a patch for that, just let me know. Dealing with offsets on a per-message basis may not be as efficient as dealing with them on a batch basis using the HasOffsetRanges interface... but if efficiency was a primary concern, you probably wouldn't be using Python anyway. On Fri, Jun 12, 2015 at 1:05 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com: Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com wrote: OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com: Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper but rather within Spark itself. If you want to be able to use zookeeper based monitoring tools to keep track of progress, then this is needed. In my specific case we need to persist Kafka offsets externally so that we can continue from where we left off after a code deployment. In other words, we need exactly-once processing guarantees across code deployments. Spark does not support any state persistence across deployments so this is something we need to handle on our own. Hope that helps. Let me know if not. Thanks! Amit On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the release of 1.4! I have been trying out the direct Kafka support in python but haven't been able to figure out how to get the offsets from the RDD. Looks like the documentation is yet to be updated to include Python examples ( https://spark.apache.org/docs/latest/streaming-kafka-integration.html). I am specifically looking for the equivalent of https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2. I tried digging through the python code but could not find anything related. Any pointers would be greatly appreciated. Thanks! Amit
Contribution
Hi everyone, I am interest to contribute new algorithms and optimize existing algorithms in the area of graph algorithms and machine learning. Please give me some ideas where to start. Is it possible for me to introduce the notion of neural network in the apache spark -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?
Hi Juan, I have created a ticket for this: https://issues.apache.org/jira/browse/SPARK-8337 Thanks! Amit On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, If you want I would be happy to work in this. I have worked with KafkaUtils.createDirectStream before, in a pull request that wasn't accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python and I'm starting to feel comfortable with Scala, so if someone opens a JIRA I can take it. Greetings, Juan Rodriguez 2015-06-12 15:59 GMT+02:00 Cody Koeninger c...@koeninger.org: The scala api has 2 ways of calling createDirectStream. One of them allows you to pass a message handler that gets full access to the kafka MessageAndMetadata, including offset. I don't know why the python api was developed with only one way to call createDirectStream, but the first thing I'd look at would be adding that functionality back in. If someone wants help creating a patch for that, just let me know. Dealing with offsets on a per-message basis may not be as efficient as dealing with them on a batch basis using the HasOffsetRanges interface... but if efficiency was a primary concern, you probably wouldn't be using Python anyway. On Fri, Jun 12, 2015 at 1:05 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com: Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com wrote: OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com: Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper but rather within Spark itself. If you want to be able to use zookeeper based monitoring tools to keep track of progress, then this is needed. In my specific case we need to persist Kafka offsets externally so that we can continue from where we left off after a code deployment. In other words, we need exactly-once processing guarantees across code deployments. Spark does not support any state persistence across deployments so this is something we need to handle on our own. Hope that helps. Let me know if not. Thanks! Amit On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the release of 1.4! I have been trying out the direct Kafka support in python but haven't been able to figure out how to get the offsets from the RDD. Looks like the documentation is yet to be updated to include Python examples ( https://spark.apache.org/docs/latest/streaming-kafka-integration.html). I am specifically looking for the equivalent of https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2. I tried digging through the python code but could not find anything related. Any pointers would be greatly appreciated. Thanks! Amit
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
I feel this is quite different from the Java 6 decision and personally I don't see sufficient cause to do it. I would like to understand though Sean - what is the proposal exactly? Hadoop 2 itself supports all of the Hadoop 1 API's, so things like removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think that makes much sense since so many libraries still use those API's. For YARN support, we already don't support Hadoop 1. So I'll assume what you mean is to prevent or stop supporting from linking against the Hadoop 1 filesystem binaries at runtime (is that right?). The main reason I'd push back is that I do think there are still people running the older versions. For instance at Databricks we use the FileSystem library for talking to S3... every time we've tried to upgrade to Hadoop 2.X there have been significant regressions in performance and we've had to downgrade. That's purely anecdotal, but I think you have people out there using the Hadoop 1 bindings for whom upgrade would be a pain. In terms of our maintenance cost, to me the much bigger cost for us IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where major new API's were added. In comparison the Hadoop 1 vs 2 seems fairly low with just a few bugs cropping up here and there. So unlike Java 6 where you have a critical mass of maintenance issues, security issues, etc, I just don't see as compelling a cost here. To me the framework for deciding about these upgrades is the maintenance cost vs the inconvenience for users. - Patrick On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
+1 for Hadoop 2.2+ On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
My 2 cents: The biggest reason from my view for keeping Hadoop 1 support was that our EC2 scripts which launch an environment for benchmarking / testing / research only supported Hadoop 1 variants till very recently. We did add Hadoop 2.4 support a few weeks back but that it is still not the default option. My concern is that people have higher level projects which are linked to Hadoop 1.0.4 + Spark, because that is the default environment on EC2, and that users will be surprised when these applications stop working in Spark 1.5. I guess we could announce more widely and write transition guides, but if the cost of supporting Hadoop1 is low enough, I'd vote to keeping it. Thanks Shivaram On Fri, Jun 12, 2015 at 9:11 AM, Ram Sriharsha sriharsha@gmail.com wrote: +1 for Hadoop 2.2+ On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?
The scala api has 2 ways of calling createDirectStream. One of them allows you to pass a message handler that gets full access to the kafka MessageAndMetadata, including offset. I don't know why the python api was developed with only one way to call createDirectStream, but the first thing I'd look at would be adding that functionality back in. If someone wants help creating a patch for that, just let me know. Dealing with offsets on a per-message basis may not be as efficient as dealing with them on a batch basis using the HasOffsetRanges interface... but if efficiency was a primary concern, you probably wouldn't be using Python anyway. On Fri, Jun 12, 2015 at 1:05 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com: Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com wrote: OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com: Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper but rather within Spark itself. If you want to be able to use zookeeper based monitoring tools to keep track of progress, then this is needed. In my specific case we need to persist Kafka offsets externally so that we can continue from where we left off after a code deployment. In other words, we need exactly-once processing guarantees across code deployments. Spark does not support any state persistence across deployments so this is something we need to handle on our own. Hope that helps. Let me know if not. Thanks! Amit On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the release of 1.4! I have been trying out the direct Kafka support in python but haven't been able to figure out how to get the offsets from the RDD. Looks like the documentation is yet to be updated to include Python examples ( https://spark.apache.org/docs/latest/streaming-kafka-integration.html). I am specifically looking for the equivalent of https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2. I tried digging through the python code but could not find anything related. Any pointers would be greatly appreciated. Thanks! Amit
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I would like to understand though Sean - what is the proposal exactly? Hadoop 2 itself supports all of the Hadoop 1 API's, so things like removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think Not entirely; you can see some binary incompatibilities that have bitten recently. A Hadoop 1 program does not in general work on Hadoop 2 because of this. Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully works anymore anyway. See for example SPARK-8057 recently. I recall similar problems with Hadoop 2.0.x-era releases and the Spark build for that which is basically the 'cdh4' build. So one benefit is skipping whatever work would be needed to continue to fix this up, and, the argument is there may be less loss of functionality than it seems. The other is being able to use later APIs. This much is a little minor. The main reason I'd push back is that I do think there are still people running the older versions. For instance at Databricks we use the FileSystem library for talking to S3... every time we've tried to upgrade to Hadoop 2.X there have been significant regressions in performance and we've had to downgrade. That's purely anecdotal, but I think you have people out there using the Hadoop 1 bindings for whom upgrade would be a pain. Yeah, that's the question. Is anyone out there using 1.x? More anecdotes wanted. That might be the most interesting question. No CDH customers would have been for a long while now, for example. (Still a small number of CDH 4 customers out there though, and that's 2.0.x or so, but that's a gray area.) Is the S3 library thing really related to Hadoop 1.x? that comes from jets3t and that's independent. In terms of our maintenance cost, to me the much bigger cost for us IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where major new API's were added. In comparison the Hadoop 1 vs 2 seems Really? I'd say the opposite. No APIs that are only in 2.2, let alone only in a later version, can be in use now, right? 1.x wouldn't work at all then. I don't know of any binary incompatibilities of the type between 1.x and 2.x, which we have had to shim to make work. In both cases dependencies have to be harmonized here and there, yes. That won't change. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
I don't imagine that can be guaranteed to be supported anyway... the 0.x branch has never necessarily worked with Spark, even if it might happen to. Is this really something you would veto for everyone because of your deployment? On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote: -1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those. cheers, Tom - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
I don't like the idea of removing Hadoop 1 unless it becomes a significant maintenance burden, which I don't think it is. You'll always be surprised how many people use old software, even though various companies may no longer support them. With Hadoop 2 in particular, I may be misremembering, but I believe that the experience on Windows is considerably worse because it requires these shell scripts to set permissions that it won't find if you just download Spark. That would be one reason to keep Hadoop 1 in the default build. But I could be wrong, it's been a while since I tried Windows. Matei On Jun 12, 2015, at 11:21 AM, Sean Owen so...@cloudera.com wrote: I don't imagine that can be guaranteed to be supported anyway... the 0.x branch has never necessarily worked with Spark, even if it might happen to. Is this really something you would veto for everyone because of your deployment? On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote: -1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those. cheers, Tom - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
0.23 (and hive 0.12) code base in Spark works well from our perspective, so not sure what you are referring to. As I said, I'm happy to maintain my own plugins but as it stands there is no sane way to do so in Spark because there is no clear separation/developer APIs for these. cheers, Tom On Fri, Jun 12, 2015 at 11:21 AM, Sean Owen so...@cloudera.com wrote: I don't imagine that can be guaranteed to be supported anyway... the 0.x branch has never necessarily worked with Spark, even if it might happen to. Is this really something you would veto for everyone because of your deployment? On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote: -1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those. cheers, Tom
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
-1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those. cheers, Tom On Fri, Jun 12, 2015 at 9:42 AM, Sean Owen so...@cloudera.com wrote: On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I would like to understand though Sean - what is the proposal exactly? Hadoop 2 itself supports all of the Hadoop 1 API's, so things like removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think Not entirely; you can see some binary incompatibilities that have bitten recently. A Hadoop 1 program does not in general work on Hadoop 2 because of this. Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully works anymore anyway. See for example SPARK-8057 recently. I recall similar problems with Hadoop 2.0.x-era releases and the Spark build for that which is basically the 'cdh4' build. So one benefit is skipping whatever work would be needed to continue to fix this up, and, the argument is there may be less loss of functionality than it seems. The other is being able to use later APIs. This much is a little minor. The main reason I'd push back is that I do think there are still people running the older versions. For instance at Databricks we use the FileSystem library for talking to S3... every time we've tried to upgrade to Hadoop 2.X there have been significant regressions in performance and we've had to downgrade. That's purely anecdotal, but I think you have people out there using the Hadoop 1 bindings for whom upgrade would be a pain. Yeah, that's the question. Is anyone out there using 1.x? More anecdotes wanted. That might be the most interesting question. No CDH customers would have been for a long while now, for example. (Still a small number of CDH 4 customers out there though, and that's 2.0.x or so, but that's a gray area.) Is the S3 library thing really related to Hadoop 1.x? that comes from jets3t and that's independent. In terms of our maintenance cost, to me the much bigger cost for us IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where major new API's were added. In comparison the Hadoop 1 vs 2 seems Really? I'd say the opposite. No APIs that are only in 2.2, let alone only in a later version, can be in use now, right? 1.x wouldn't work at all then. I don't know of any binary incompatibilities of the type between 1.x and 2.x, which we have had to shim to make work. In both cases dependencies have to be harmonized here and there, yes. That won't change. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org