A confusing ClassNotFoundException error

2015-06-12 Thread Zhiwei Chan
Hi all, I encounter an error at spark 1.4.0, and I make an error example as following. Both of the code can run OK on spark-shell, but the second code encounter an error using spark-submit. The only different is that the second code uses a literal function in the map(). but the first code uses a

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Saisai Shao
Scala KafkaRDD uses a trait to handle this problem, but it is not so easy and straightforward in Python, where we need to have a specific API to handle this, I'm not sure is there any simple workaround to fix this, maybe we should think carefully about it. 2015-06-12 13:59 GMT+08:00 Amit Ramesh

Re: Contributing to pyspark

2015-06-12 Thread Manoj Kumar
1, Yes, because the issues are in JIRA. 2. Nope, (at least as far as MLlib is concerned) because most if it are just wrappers to the underlying Scala functions or methods and are not implemented in pure Python. 3. I'm not sure about this. It seems to work fine for me! HTH On Fri, Jun 12, 2015 at

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-12 Thread Cheng Lian
Would you mind to file a JIRA for this? Thanks! Cheng On 6/11/15 2:40 PM, Dong Lei wrote: I think in standalone cluster mode, spark is supposed to do: 1.Download jars, files to driver 2.Set the driver’s class path 3.Driver setup a http file server to distribute these files 4.Worker

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Steve Loughran
+1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the

Re: When to expect UTF8String?

2015-06-12 Thread Michael Armbrust
1. Custom aggregators that do map-side combine. This is something I'd hoping to add in Spark 1.5 2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number

RE: When to expect UTF8String?

2015-06-12 Thread Zack Sampson
We are using Expression for two things. 1. Custom aggregators that do map-side combine. 2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number of

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Juan Rodríguez Hortalá
Hi, If you want I would be happy to work in this. I have worked with KafkaUtils.createDirectStream before, in a pull request that wasn't accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python and I'm starting to feel comfortable with Scala, so if someone opens a JIRA I can

Contribution

2015-06-12 Thread srinivasraghavansr71
Hi everyone, I am interest to contribute new algorithms and optimize existing algorithms in the area of graph algorithms and machine learning. Please give me some ideas where to start. Is it possible for me to introduce the notion of neural network in the apache spark -- View

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Amit Ramesh
Hi Juan, I have created a ticket for this: https://issues.apache.org/jira/browse/SPARK-8337 Thanks! Amit On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, If you want I would be happy to work in this. I have worked with

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Patrick Wendell
I feel this is quite different from the Java 6 decision and personally I don't see sufficient cause to do it. I would like to understand though Sean - what is the proposal exactly? Hadoop 2 itself supports all of the Hadoop 1 API's, so things like removing the Hadoop 1 variant of sc.hadoopFile,

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Ram Sriharsha
+1 for Hadoop 2.2+ On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Shivaram Venkataraman
My 2 cents: The biggest reason from my view for keeping Hadoop 1 support was that our EC2 scripts which launch an environment for benchmarking / testing / research only supported Hadoop 1 variants till very recently. We did add Hadoop 2.4 support a few weeks back but that it is still not the

Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Sean Owen
How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Cody Koeninger
The scala api has 2 ways of calling createDirectStream. One of them allows you to pass a message handler that gets full access to the kafka MessageAndMetadata, including offset. I don't know why the python api was developed with only one way to call createDirectStream, but the first thing I'd

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Nicholas Chammas
I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Sean Owen
On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I would like to understand though Sean - what is the proposal exactly? Hadoop 2 itself supports all of the Hadoop 1 API's, so things like removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think Not entirely;

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Sean Owen
I don't imagine that can be guaranteed to be supported anyway... the 0.x branch has never necessarily worked with Spark, even if it might happen to. Is this really something you would veto for everyone because of your deployment? On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Matei Zaharia
I don't like the idea of removing Hadoop 1 unless it becomes a significant maintenance burden, which I don't think it is. You'll always be surprised how many people use old software, even though various companies may no longer support them. With Hadoop 2 in particular, I may be misremembering,

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Thomas Dudziak
0.23 (and hive 0.12) code base in Spark works well from our perspective, so not sure what you are referring to. As I said, I'm happy to maintain my own plugins but as it stands there is no sane way to do so in Spark because there is no clear separation/developer APIs for these. cheers, Tom On

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Thomas Dudziak
-1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those.