A confusing ClassNotFoundException error

2015-06-12 Thread Zhiwei Chan
Hi all,

I encounter an error at spark 1.4.0, and I make an error example as
following. Both of the code can run OK on spark-shell, but the second code
encounter an error using spark-submit. The only different is that the
second code uses a literal function in the map(). but the first code uses a
defined function in the map().

Could anyone tell me why this error happen?  Thanks

==the ok code =
val sparkConf = new SparkConf().setAppName(Mode Example)
val sc = new SparkContext(sparkConf)
val hive = new HiveContext(sc)

val rdd = sc.parallelize(Range(0,1000))
def fff = (input: Int)=input+1
val cnt = rdd.map(i = Array(1,2,3)).map(arr = {
  arr.map(fff)
}).count()
print(cnt)

==not ok code 
val sparkConf = new SparkConf().setAppName(Mode Example)
val sc = new SparkContext(sparkConf)
val hive = new HiveContext(sc)

val rdd = sc.parallelize(Range(0,1000))
//def fff = (input: Int)=input+1
val cnt = rdd.map(i = Array(1,2,3)).map(arr = {
  arr.map(_+1)
}).count()
print(cnt)

Submit command: spark-submit --class com.yhd.ycache.magic.Model --jars
./SSExample-0.0.2-SNAPSHOT-jar-with-dependencies.jar  --master local
./SSExample-0.0.2-SNAPSHOT-jar-with-dependencies.jar

the error info:
Exception in thread main java.lang.ClassNotFoundException:
com.yhd.ycache.magic.Model$$anonfun$10$$anonfun$apply$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at com.yhd.ycache.magic.Model$.main(SSExample.scala:238)
at com.yhd.ycache.magic.Model.main(SSExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Saisai Shao
Scala KafkaRDD uses a trait to handle this problem, but it is not so easy
and straightforward in Python, where we need to have a specific API to
handle this, I'm not sure is there any simple workaround to fix this, maybe
we should think carefully about it.

2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com:


 Thanks, Jerry. That's what I suspected based on the code I looked at. Any
 pointers on what is needed to build in this support would be great. This is
 critical to the project we are currently working on.

 Thanks!


 On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 OK, I get it, I think currently Python based Kafka direct API do not
 provide such equivalence like Scala, maybe we should figure out to add this
 into Python API also.

 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:


 Hi Jerry,

 Take a look at this example:
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

 The offsets are needed because as RDDs get generated within spark the
 offsets move further along. With direct Kafka mode the current offsets are
 no more persisted in Zookeeper but rather within Spark itself. If you want
 to be able to use zookeeper based monitoring tools to keep track of
 progress, then this is needed.

 In my specific case we need to persist Kafka offsets externally so that
 we can continue from where we left off after a code deployment. In other
 words, we need exactly-once processing guarantees across code deployments.
 Spark does not support any state persistence across deployments so this is
 something we need to handle on our own.

 Hope that helps. Let me know if not.

 Thanks!
 Amit


 On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Hi,

 What is your meaning of getting the offsets from the RDD, from my
 understanding, the offsetRange is a parameter you offered to KafkaRDD, why
 do you still want to get the one previous you set into?

 Thanks
 Jerry

 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but haven't
 been able to figure out how to get the offsets from the RDD. Looks like 
 the
 documentation is yet to be updated to include Python examples (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit








Re: Contributing to pyspark

2015-06-12 Thread Manoj Kumar
1, Yes, because the issues are in JIRA.
2. Nope, (at least as far as MLlib is concerned) because most if it are
just wrappers to the underlying Scala functions or methods and are not
implemented in pure Python.
3. I'm not sure about this. It seems to work fine for me!

HTH

On Fri, Jun 12, 2015 at 10:41 AM, Usman Ehtesham Gul uehtesha...@gmail.com
wrote:

 Hello Manoj,

 First of all thank you for the quick reply. Just a couple of more things.
 I have started reading the link you provided; I will definitely filter JIRA
 with PySpark.

 Can you verify:

 1) We fork from Github right? I ask because on Github, I see its mirrored
 and there are no issues section. I am assuming because that is done in Jira.
 2) To contribute to PySpark, we will have to clone the whole project. But
 if our changes/contributions are only specific to pyspark, we can do those
 too without relying on core spark and other client libraries right?
 3) I think the email u...@spark.apache.org is broken. I am getting email
 from  mailer-dae...@apache.org that email could be sent to this address.
 Can you check this?

 Thank you again. Hope to hear from you soon.

 Usman

 On Jun 12, 2015, at 12:57 AM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:

 Hi,

 Thanks for your interest in PySpark.

 The first thing is to have a look at the how to contribute guide
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 and filter the JIRA's using the label PySpark.

 If you have your own improvement in mind, you can file your a JIRA,
 discuss and then send a Pull Request

 HTH

 Regards.

 On Fri, Jun 12, 2015 at 9:36 AM, Usman Ehtesham uehtesha...@gmail.com
 wrote:

 Hello,

 I am currently taking a course in Apache Spark via EdX (
 https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x)
 and at the same time I try to look at the code for pySpark too. I wanted to
 ask, if ideally I would like to contribute to pyspark specifically, how can
 I do that? I do not intend to contribute to core Apache Spark any time soon
 (mainly because I do not know Scala) but I am very comfortable in Python.

 Any tips on how to contribute specifically to pyspark without being
 affected by other parts of Spark would be greatly appreciated.

 P.S.: I ask this because there is a small change/improvement I would like
 to propose. Also since I just started learning Spark, I would like to also
 read and understand the pyspark code as I learn about Spark. :)

 Hope to hear from you soon.

 Usman Ehtesham Gul
 https://github.com/ueg1990




 --
 Godspeed,
 Manoj Kumar,
 http://manojbits.wordpress.com
 http://goog_1017110195/
 http://github.com/MechCoder





-- 
Godspeed,
Manoj Kumar,
http://manojbits.wordpress.com
http://goog_1017110195
http://github.com/MechCoder


Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-12 Thread Cheng Lian

Would you mind to file a JIRA for this? Thanks!

Cheng

On 6/11/15 2:40 PM, Dong Lei wrote:


I think in standalone cluster mode, spark is supposed to do:

1.Download jars, files to driver

2.Set the driver’s class path

3.Driver setup a http file server to distribute these files

4.Worker download from driver and setup classpath

Right?

But somehow, the first step fails.

Even if I can make the first step works(use option1), it seems that 
the classpath in driver is not correctly set.


Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 2:32 PM
*To:* Dong Lei
*Cc:* Dianfei (Keith) Han; dev@spark.apache.org
*Subject:* Re: How to support dependency jars and files on HDFS in 
standalone cluster mode?


Oh sorry, I mistook --jars for --files. Yeah, for jars we need to add 
them to classpath, which is different from regular files.


Cheng

On 6/11/15 2:18 PM, Dong Lei wrote:

Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(and
files) on HDFS?

Do you mean the driver will not need to setup a HTTP file server
for this scenario and the worker will fetch the jars and files
from HDFS?

Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org mailto:dev@spark.apache.org
*Cc:* Dianfei (Keith) Han
*Subject:* Re: How to support dependency jars and files on HDFS in
standalone cluster mode?

Since the jars are already on HDFS, you can access them directly
in your Spark application without using --jars

Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:

Hi spark-dev:

I can not use a hdfs location for the “--jars” or “--files”
option when doing a spark-submit in a standalone cluster mode.
For example:

Spark-submit  …  --jars hdfs://ip/1.jar  ….
 hdfs://ip/app.jar (standalone cluster mode)

will not download 1.jar to driver’s http file server(but the
app.jar will be downloaded to the driver’s dir).

I figure out the reason spark not downloading the jars is that
when doing sc.addJar to http file server, the function called
is Files.copy which does not support a remote location.

And I think if spark can download the jars and add them to
http file server, the classpath is not correctly set, because
the classpath contains remote location.

So I’m trying to make it work and come up with two options,
but neither of them seem to be elegant, and I want to hear
your advices:

Option 1:

Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs”
prefix.

This is not good because I think it breaks the scope of http
file server.

Option 2:

Modify DriverRunner.downloadUserJar, let it download all the
“--jars” and “--files” with the application jar.

This sounds more reasonable that option 1 for downloading
files. But this way I need to read the “spark.jars” and
“spark.files” on downloadUserJar or DriverRunnder.start and
replace it with a local path. How can I do that?

Do you have a more elegant solution, or do we have a plan to
support it in the furture?

Thanks

Dong Lei





Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Steve Loughran
+1 for 2.2+

Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 
2.x  spark, and bugs in Hadoop itself being fixed.

(usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

 On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
 How does the idea of removing support for Hadoop 1.x for Spark 1.5
 strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
 consistent with the modern 2.x line than 2.1 or 2.0.
 
 The arguments against are simply, well, someone out there might be
 using these versions.
 
 The arguments for are just simplification -- fewer gotchas in trying
 to keep supporting older Hadoop, of which we've seen several lately.
 We get to chop out a little bit of shim code and update to use some
 non-deprecated APIs. Along with removing support for Java 6, it might
 be a reasonable time to also draw a line under older Hadoop too.
 
 I'm just gauging feeling now: for, against, indifferent?
 I favor it, but would not push hard on it if there are objections.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: When to expect UTF8String?

2015-06-12 Thread Michael Armbrust

 1. Custom aggregators that do map-side combine.


This is something I'd hoping to add in Spark 1.5


 2. UDFs with more than 22 arguments which is not supported by ScalaUdf,
 and to avoid wrapping a Java function interface in one of 22 different
 Scala function interfaces depending on the number of parameters.


I'm super open to suggestions here.  Mind possibly opening a JIRA with a
proposed interface?


RE: When to expect UTF8String?

2015-06-12 Thread Zack Sampson
We are using Expression for two things.

1. Custom aggregators that do map-side combine.

2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to 
avoid wrapping a Java function interface in one of 22 different Scala function 
interfaces depending on the number of parameters.

Are there methods we can use to convert to/from the internal representation in 
these cases?

From: Michael Armbrust [mich...@databricks.com]
Sent: Thursday, June 11, 2015 9:05 PM
To: Zack Sampson
Cc: dev@spark.apache.org
Subject: Re: When to expect UTF8String?

Through the DataFrame API, users should never see UTF8String.

Expression (and any class in the catalyst package) is considered internal and 
so uses the internal representation of various types.  Which type we use here 
is not stable across releases.

Is there a reason you aren't defining a UDF instead?

On Thu, Jun 11, 2015 at 8:08 PM, zsampson 
zsamp...@palantir.commailto:zsamp...@palantir.com wrote:
I'm hoping for some clarity about when to expect String vs UTF8String when
using the Java DataFrames API.

In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was
once a String is now a UTF8String. The comments in the file and the related
commit message indicate that maybe it should be internal to SparkSQL's
implementation.

However, when I add a column containing a custom subclass of Expression, the
row passed to the eval method contains instances of UTF8String. Ditto for
AggregateFunction.update. Is this expected? If so, when should I generally
know to deal with UTF8String objects?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.htmlhttps://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_When-2Dto-2Dexpect-2DUTF8String-2Dtp12710.htmld=BQMFaQc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=JTeV6BsFY8hARUm33aoIqBdzwQIcuTioZt881I11O_Mm=03dQBm7iPTCL33eIdtabOwGkj02beDizwxfaDAv1Xhss=EhYOx1s29rjLhkJfDhjQ_9QFNdw0GZ_YxaV6ZiXuqase=
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Juan Rodríguez Hortalá
Hi,

If you want I would be happy to work in this. I have worked with
KafkaUtils.createDirectStream before, in a pull request that wasn't
accepted https://github.com/apache/spark/pull/5367. I'm fluent with Python
and I'm starting to feel comfortable with Scala, so if someone opens a JIRA
I can take it.

Greetings,

Juan Rodriguez


2015-06-12 15:59 GMT+02:00 Cody Koeninger c...@koeninger.org:

 The scala api has 2 ways of calling createDirectStream.  One of them
 allows you to pass a message handler that gets full access to the kafka
 MessageAndMetadata, including offset.

 I don't know why the python api was developed with only one way to call
 createDirectStream, but the first thing I'd look at would be adding that
 functionality back in.  If someone wants help creating a patch for that,
 just let me know.

 Dealing with offsets on a per-message basis may not be as efficient as
 dealing with them on a batch basis using the HasOffsetRanges interface...
 but if efficiency was a primary concern, you probably wouldn't be using
 Python anyway.

 On Fri, Jun 12, 2015 at 1:05 AM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Scala KafkaRDD uses a trait to handle this problem, but it is not so easy
 and straightforward in Python, where we need to have a specific API to
 handle this, I'm not sure is there any simple workaround to fix this, maybe
 we should think carefully about it.

 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com:


 Thanks, Jerry. That's what I suspected based on the code I looked at.
 Any pointers on what is needed to build in this support would be great.
 This is critical to the project we are currently working on.

 Thanks!


 On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 OK, I get it, I think currently Python based Kafka direct API do not
 provide such equivalence like Scala, maybe we should figure out to add this
 into Python API also.

 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:


 Hi Jerry,

 Take a look at this example:
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

 The offsets are needed because as RDDs get generated within spark the
 offsets move further along. With direct Kafka mode the current offsets are
 no more persisted in Zookeeper but rather within Spark itself. If you want
 to be able to use zookeeper based monitoring tools to keep track of
 progress, then this is needed.

 In my specific case we need to persist Kafka offsets externally so
 that we can continue from where we left off after a code deployment. In
 other words, we need exactly-once processing guarantees across code
 deployments. Spark does not support any state persistence across
 deployments so this is something we need to handle on our own.

 Hope that helps. Let me know if not.

 Thanks!
 Amit


 On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Hi,

 What is your meaning of getting the offsets from the RDD, from my
 understanding, the offsetRange is a parameter you offered to KafkaRDD, 
 why
 do you still want to get the one previous you set into?

 Thanks
 Jerry

 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but
 haven't been able to figure out how to get the offsets from the RDD. 
 Looks
 like the documentation is yet to be updated to include Python examples (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit










Contribution

2015-06-12 Thread srinivasraghavansr71
Hi everyone,
 I am interest to contribute new algorithms and optimize
existing algorithms in the area of graph algorithms and machine learning.
Please give me some ideas where to start. Is it possible for me to introduce
the notion of neural network in the apache spark



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Amit Ramesh
Hi Juan,

I have created a ticket for this:
https://issues.apache.org/jira/browse/SPARK-8337

Thanks!
Amit


On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 If you want I would be happy to work in this. I have worked with
 KafkaUtils.createDirectStream before, in a pull request that wasn't
 accepted https://github.com/apache/spark/pull/5367. I'm fluent with
 Python and I'm starting to feel comfortable with Scala, so if someone opens
 a JIRA I can take it.

 Greetings,

 Juan Rodriguez


 2015-06-12 15:59 GMT+02:00 Cody Koeninger c...@koeninger.org:

 The scala api has 2 ways of calling createDirectStream.  One of them
 allows you to pass a message handler that gets full access to the kafka
 MessageAndMetadata, including offset.

 I don't know why the python api was developed with only one way to call
 createDirectStream, but the first thing I'd look at would be adding that
 functionality back in.  If someone wants help creating a patch for that,
 just let me know.

 Dealing with offsets on a per-message basis may not be as efficient as
 dealing with them on a batch basis using the HasOffsetRanges interface...
 but if efficiency was a primary concern, you probably wouldn't be using
 Python anyway.

 On Fri, Jun 12, 2015 at 1:05 AM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Scala KafkaRDD uses a trait to handle this problem, but it is not so
 easy and straightforward in Python, where we need to have a specific API to
 handle this, I'm not sure is there any simple workaround to fix this, maybe
 we should think carefully about it.

 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com:


 Thanks, Jerry. That's what I suspected based on the code I looked at.
 Any pointers on what is needed to build in this support would be great.
 This is critical to the project we are currently working on.

 Thanks!


 On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 OK, I get it, I think currently Python based Kafka direct API do not
 provide such equivalence like Scala, maybe we should figure out to add 
 this
 into Python API also.

 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:


 Hi Jerry,

 Take a look at this example:
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

 The offsets are needed because as RDDs get generated within spark the
 offsets move further along. With direct Kafka mode the current offsets 
 are
 no more persisted in Zookeeper but rather within Spark itself. If you 
 want
 to be able to use zookeeper based monitoring tools to keep track of
 progress, then this is needed.

 In my specific case we need to persist Kafka offsets externally so
 that we can continue from where we left off after a code deployment. In
 other words, we need exactly-once processing guarantees across code
 deployments. Spark does not support any state persistence across
 deployments so this is something we need to handle on our own.

 Hope that helps. Let me know if not.

 Thanks!
 Amit


 On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com
  wrote:

 Hi,

 What is your meaning of getting the offsets from the RDD, from my
 understanding, the offsetRange is a parameter you offered to KafkaRDD, 
 why
 do you still want to get the one previous you set into?

 Thanks
 Jerry

 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but
 haven't been able to figure out how to get the offsets from the RDD. 
 Looks
 like the documentation is yet to be updated to include Python examples 
 (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit











Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Patrick Wendell
I feel this is quite different from the Java 6 decision and personally
I don't see sufficient cause to do it.

I would like to understand though Sean - what is the proposal exactly?
Hadoop 2 itself supports all of the Hadoop 1 API's, so things like
removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think
that makes much sense since so many libraries still use those API's.
For YARN support, we already don't support Hadoop 1. So I'll assume
what you mean is to prevent or stop supporting from linking against
the Hadoop 1 filesystem binaries at runtime (is that right?).

The main reason I'd push back is that I do think there are still
people running the older versions. For instance at Databricks we use
the FileSystem library for talking to S3... every time we've tried to
upgrade to Hadoop 2.X there have been significant regressions in
performance and we've had to downgrade. That's purely anecdotal, but I
think you have people out there using the Hadoop 1 bindings for whom
upgrade would be a pain.

In terms of our maintenance cost, to me the much bigger cost for us
IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where
major new API's were added. In comparison the Hadoop 1 vs 2 seems
fairly low with just a few bugs cropping up here and there. So unlike
Java 6 where you have a critical mass of maintenance issues, security
issues, etc, I just don't see as compelling a cost here.

To me the framework for deciding about these upgrades is the
maintenance cost vs the inconvenience for users.

- Patrick

On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 I'm personally in favor, but I don't have a sense of how many people still
 rely on Hadoop 1.

 Nick

 2015년 6월 12일 (금) 오전 9:13, Steve Loughran
 ste...@hortonworks.com님이 작성:

 +1 for 2.2+

 Not only are the APis in Hadoop 2 better, there's more people testing
 Hadoop 2.x  spark, and bugs in Hadoop itself being fixed.

 (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

  On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
  How does the idea of removing support for Hadoop 1.x for Spark 1.5
  strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
  consistent with the modern 2.x line than 2.1 or 2.0.
 
  The arguments against are simply, well, someone out there might be
  using these versions.
 
  The arguments for are just simplification -- fewer gotchas in trying
  to keep supporting older Hadoop, of which we've seen several lately.
  We get to chop out a little bit of shim code and update to use some
  non-deprecated APIs. Along with removing support for Java 6, it might
  be a reasonable time to also draw a line under older Hadoop too.
 
  I'm just gauging feeling now: for, against, indifferent?
  I favor it, but would not push hard on it if there are objections.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Ram Sriharsha
+1 for Hadoop 2.2+

On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I'm personally in favor, but I don't have a sense of how many people still
 rely on Hadoop 1.

 Nick

 2015년 6월 12일 (금) 오전 9:13, Steve Loughran
 ste...@hortonworks.com님이 작성:

 +1 for 2.2+

 Not only are the APis in Hadoop 2 better, there's more people testing
 Hadoop 2.x  spark, and bugs in Hadoop itself being fixed.

 (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

  On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
  How does the idea of removing support for Hadoop 1.x for Spark 1.5
  strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
  consistent with the modern 2.x line than 2.1 or 2.0.
 
  The arguments against are simply, well, someone out there might be
  using these versions.
 
  The arguments for are just simplification -- fewer gotchas in trying
  to keep supporting older Hadoop, of which we've seen several lately.
  We get to chop out a little bit of shim code and update to use some
  non-deprecated APIs. Along with removing support for Java 6, it might
  be a reasonable time to also draw a line under older Hadoop too.
 
  I'm just gauging feeling now: for, against, indifferent?
  I favor it, but would not push hard on it if there are objections.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Shivaram Venkataraman
My 2 cents: The biggest reason from my view for keeping Hadoop 1 support
was that our EC2 scripts which launch an environment for benchmarking /
testing / research only supported Hadoop 1 variants till very recently.  We
did add Hadoop 2.4 support a few weeks back but that it is still not the
default option.

My concern is that people have higher level projects which are linked to
Hadoop 1.0.4 + Spark, because that is the default environment on EC2, and
that users will be surprised when these applications stop working in Spark
1.5. I guess we could announce more widely and write transition guides, but
if the cost of supporting Hadoop1 is low enough, I'd vote to keeping it.

Thanks
Shivaram

On Fri, Jun 12, 2015 at 9:11 AM, Ram Sriharsha sriharsha@gmail.com
wrote:

 +1 for Hadoop 2.2+

 On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'm personally in favor, but I don't have a sense of how many people
 still rely on Hadoop 1.

 Nick

 2015년 6월 12일 (금) 오전 9:13, Steve Loughran
 ste...@hortonworks.com님이 작성:

 +1 for 2.2+

 Not only are the APis in Hadoop 2 better, there's more people testing
 Hadoop 2.x  spark, and bugs in Hadoop itself being fixed.

 (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

  On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
  How does the idea of removing support for Hadoop 1.x for Spark 1.5
  strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
  consistent with the modern 2.x line than 2.1 or 2.0.
 
  The arguments against are simply, well, someone out there might be
  using these versions.
 
  The arguments for are just simplification -- fewer gotchas in trying
  to keep supporting older Hadoop, of which we've seen several lately.
  We get to chop out a little bit of shim code and update to use some
  non-deprecated APIs. Along with removing support for Java 6, it might
  be a reasonable time to also draw a line under older Hadoop too.
 
  I'm just gauging feeling now: for, against, indifferent?
  I favor it, but would not push hard on it if there are objections.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Sean Owen
How does the idea of removing support for Hadoop 1.x for Spark 1.5
strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
consistent with the modern 2.x line than 2.1 or 2.0.

The arguments against are simply, well, someone out there might be
using these versions.

The arguments for are just simplification -- fewer gotchas in trying
to keep supporting older Hadoop, of which we've seen several lately.
We get to chop out a little bit of shim code and update to use some
non-deprecated APIs. Along with removing support for Java 6, it might
be a reasonable time to also draw a line under older Hadoop too.

I'm just gauging feeling now: for, against, indifferent?
I favor it, but would not push hard on it if there are objections.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Cody Koeninger
The scala api has 2 ways of calling createDirectStream.  One of them allows
you to pass a message handler that gets full access to the kafka
MessageAndMetadata, including offset.

I don't know why the python api was developed with only one way to call
createDirectStream, but the first thing I'd look at would be adding that
functionality back in.  If someone wants help creating a patch for that,
just let me know.

Dealing with offsets on a per-message basis may not be as efficient as
dealing with them on a batch basis using the HasOffsetRanges interface...
but if efficiency was a primary concern, you probably wouldn't be using
Python anyway.

On Fri, Jun 12, 2015 at 1:05 AM, Saisai Shao sai.sai.s...@gmail.com wrote:

 Scala KafkaRDD uses a trait to handle this problem, but it is not so easy
 and straightforward in Python, where we need to have a specific API to
 handle this, I'm not sure is there any simple workaround to fix this, maybe
 we should think carefully about it.

 2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com:


 Thanks, Jerry. That's what I suspected based on the code I looked at. Any
 pointers on what is needed to build in this support would be great. This is
 critical to the project we are currently working on.

 Thanks!


 On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 OK, I get it, I think currently Python based Kafka direct API do not
 provide such equivalence like Scala, maybe we should figure out to add this
 into Python API also.

 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:


 Hi Jerry,

 Take a look at this example:
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

 The offsets are needed because as RDDs get generated within spark the
 offsets move further along. With direct Kafka mode the current offsets are
 no more persisted in Zookeeper but rather within Spark itself. If you want
 to be able to use zookeeper based monitoring tools to keep track of
 progress, then this is needed.

 In my specific case we need to persist Kafka offsets externally so that
 we can continue from where we left off after a code deployment. In other
 words, we need exactly-once processing guarantees across code deployments.
 Spark does not support any state persistence across deployments so this is
 something we need to handle on our own.

 Hope that helps. Let me know if not.

 Thanks!
 Amit


 On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Hi,

 What is your meaning of getting the offsets from the RDD, from my
 understanding, the offsetRange is a parameter you offered to KafkaRDD, why
 do you still want to get the one previous you set into?

 Thanks
 Jerry

 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but haven't
 been able to figure out how to get the offsets from the RDD. Looks like 
 the
 documentation is yet to be updated to include Python examples (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit









Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Nicholas Chammas
I'm personally in favor, but I don't have a sense of how many people still
rely on Hadoop 1.

Nick

2015년 6월 12일 (금) 오전 9:13, Steve Loughran
ste...@hortonworks.com님이 작성:

+1 for 2.2+

 Not only are the APis in Hadoop 2 better, there's more people testing
 Hadoop 2.x  spark, and bugs in Hadoop itself being fixed.

 (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

  On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
  How does the idea of removing support for Hadoop 1.x for Spark 1.5
  strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
  consistent with the modern 2.x line than 2.1 or 2.0.
 
  The arguments against are simply, well, someone out there might be
  using these versions.
 
  The arguments for are just simplification -- fewer gotchas in trying
  to keep supporting older Hadoop, of which we've seen several lately.
  We get to chop out a little bit of shim code and update to use some
  non-deprecated APIs. Along with removing support for Java 6, it might
  be a reasonable time to also draw a line under older Hadoop too.
 
  I'm just gauging feeling now: for, against, indifferent?
  I favor it, but would not push hard on it if there are objections.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Sean Owen
On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
 I would like to understand though Sean - what is the proposal exactly?
 Hadoop 2 itself supports all of the Hadoop 1 API's, so things like
 removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think

Not entirely; you can see some binary incompatibilities that have
bitten recently. A Hadoop 1 program does not in general work on Hadoop
2 because of this.

Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully
works anymore anyway. See for example SPARK-8057 recently. I recall
similar problems with Hadoop 2.0.x-era releases and the Spark build
for that which is basically the 'cdh4' build.

So one benefit is skipping whatever work would be needed to continue
to fix this up, and, the argument is there may be less loss of
functionality than it seems. The other is being able to use later
APIs. This much is a little minor.


 The main reason I'd push back is that I do think there are still
 people running the older versions. For instance at Databricks we use
 the FileSystem library for talking to S3... every time we've tried to
 upgrade to Hadoop 2.X there have been significant regressions in
 performance and we've had to downgrade. That's purely anecdotal, but I
 think you have people out there using the Hadoop 1 bindings for whom
 upgrade would be a pain.

Yeah, that's the question. Is anyone out there using 1.x? More
anecdotes wanted. That might be the most interesting question.

No CDH customers would have been for a long while now, for example.
(Still a small number of CDH 4 customers out there though, and that's
2.0.x or so, but that's a gray area.)

Is the S3 library thing really related to Hadoop 1.x? that comes from
jets3t and that's independent.


 In terms of our maintenance cost, to me the much bigger cost for us
 IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where
 major new API's were added. In comparison the Hadoop 1 vs 2 seems

Really? I'd say the opposite. No APIs that are only in 2.2, let alone
only in a later version, can be in use now, right? 1.x wouldn't work
at all then. I don't know of any binary incompatibilities of the type
between 1.x and 2.x, which we have had to shim to make work.

In both cases dependencies have to be harmonized here and there, yes.
That won't change.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Sean Owen
I don't imagine that can be guaranteed to be supported anyway... the
0.x branch has never necessarily worked with Spark, even if it might
happen to. Is this really something you would veto for everyone
because of your deployment?

On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote:
 -1 to this, we use it with an old Hadoop version (well, a fork of an old
 version, 0.23). That being said, if there were a nice developer api that
 separates Spark from Hadoop (or rather, two APIs, one for scheduling and one
 for HDFS), then we'd be happy to maintain our own plugins for those.

 cheers,
 Tom


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Matei Zaharia
I don't like the idea of removing Hadoop 1 unless it becomes a significant 
maintenance burden, which I don't think it is. You'll always be surprised how 
many people use old software, even though various companies may no longer 
support them.

With Hadoop 2 in particular, I may be misremembering, but I believe that the 
experience on Windows is considerably worse because it requires these shell 
scripts to set permissions that it won't find if you just download Spark. That 
would be one reason to keep Hadoop 1 in the default build. But I could be 
wrong, it's been a while since I tried Windows.

Matei


 On Jun 12, 2015, at 11:21 AM, Sean Owen so...@cloudera.com wrote:
 
 I don't imagine that can be guaranteed to be supported anyway... the
 0.x branch has never necessarily worked with Spark, even if it might
 happen to. Is this really something you would veto for everyone
 because of your deployment?
 
 On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote:
 -1 to this, we use it with an old Hadoop version (well, a fork of an old
 version, 0.23). That being said, if there were a nice developer api that
 separates Spark from Hadoop (or rather, two APIs, one for scheduling and one
 for HDFS), then we'd be happy to maintain our own plugins for those.
 
 cheers,
 Tom
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Thomas Dudziak
0.23 (and hive 0.12) code base in Spark works well from our perspective, so
not sure what you are referring to. As I said, I'm happy to maintain my own
plugins but as it stands there is no sane way to do so in Spark because
there is no clear separation/developer APIs for these.

cheers,
Tom

On Fri, Jun 12, 2015 at 11:21 AM, Sean Owen so...@cloudera.com wrote:

 I don't imagine that can be guaranteed to be supported anyway... the
 0.x branch has never necessarily worked with Spark, even if it might
 happen to. Is this really something you would veto for everyone
 because of your deployment?

 On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote:
  -1 to this, we use it with an old Hadoop version (well, a fork of an old
  version, 0.23). That being said, if there were a nice developer api that
  separates Spark from Hadoop (or rather, two APIs, one for scheduling and
 one
  for HDFS), then we'd be happy to maintain our own plugins for those.
 
  cheers,
  Tom
 



Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Thomas Dudziak
-1 to this, we use it with an old Hadoop version (well, a fork of an old
version, 0.23). That being said, if there were a nice developer api that
separates Spark from Hadoop (or rather, two APIs, one for scheduling and
one for HDFS), then we'd be happy to maintain our own plugins for those.

cheers,
Tom

On Fri, Jun 12, 2015 at 9:42 AM, Sean Owen so...@cloudera.com wrote:

 On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  I would like to understand though Sean - what is the proposal exactly?
  Hadoop 2 itself supports all of the Hadoop 1 API's, so things like
  removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think

 Not entirely; you can see some binary incompatibilities that have
 bitten recently. A Hadoop 1 program does not in general work on Hadoop
 2 because of this.

 Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully
 works anymore anyway. See for example SPARK-8057 recently. I recall
 similar problems with Hadoop 2.0.x-era releases and the Spark build
 for that which is basically the 'cdh4' build.

 So one benefit is skipping whatever work would be needed to continue
 to fix this up, and, the argument is there may be less loss of
 functionality than it seems. The other is being able to use later
 APIs. This much is a little minor.


  The main reason I'd push back is that I do think there are still
  people running the older versions. For instance at Databricks we use
  the FileSystem library for talking to S3... every time we've tried to
  upgrade to Hadoop 2.X there have been significant regressions in
  performance and we've had to downgrade. That's purely anecdotal, but I
  think you have people out there using the Hadoop 1 bindings for whom
  upgrade would be a pain.

 Yeah, that's the question. Is anyone out there using 1.x? More
 anecdotes wanted. That might be the most interesting question.

 No CDH customers would have been for a long while now, for example.
 (Still a small number of CDH 4 customers out there though, and that's
 2.0.x or so, but that's a gray area.)

 Is the S3 library thing really related to Hadoop 1.x? that comes from
 jets3t and that's independent.


  In terms of our maintenance cost, to me the much bigger cost for us
  IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where
  major new API's were added. In comparison the Hadoop 1 vs 2 seems

 Really? I'd say the opposite. No APIs that are only in 2.2, let alone
 only in a later version, can be in use now, right? 1.x wouldn't work
 at all then. I don't know of any binary incompatibilities of the type
 between 1.x and 2.x, which we have had to shim to make work.

 In both cases dependencies have to be harmonized here and there, yes.
 That won't change.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org