from:"Jakob Odersky"

Re: Can I add a new method to RDD class?

2016-12-06 Thread Jakob Odersky

Yes, I think changing the  property (line 29) in spark's root
pom.xml should be sufficient. However, keep in mind that you'll also
need to publish spark locally before you can access it in your test
application.

On Tue, Dec 6, 2016 at 2:50 AM, Teng Long <longteng...@gmail.com> wrote:
> Thank you Jokob for clearing things up for me.
>
> Before, I thought my application was compiled against my local build since I
> can get all the logs I just added in spark-core. But it was all along using
> spark downloaded from remote maven repository, and that’s why I “cannot" add
> new RDD methods in.
>
> How can I specify a custom version? modify version numbers in all the
> pom.xml file?
>
>
>
> On Dec 5, 2016, at 9:12 PM, Jakob Odersky <ja...@odersky.com> wrote:
>
> m rdds in an "org.apache.spark" package as well
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Can I add a new method to RDD class?

2016-12-05 Thread Jakob Odersky

It looks like you're having issues with including your custom spark
version (with the extensions) in your test project. To use your local
spark version:
1) make sure it has a custom version (let's call it 2.1.0-CUSTOM)
2) publish it to your local machine with `sbt publishLocal`
3) include the modified version of spark in your test project with
`libraryDependencies += "org.apache.spark" %% "spark-core" %
"2.1.0-CUSTOM"`

However, as others have said, it can be quite a lot of work to
maintain a custom fork of spark. If you're planning on contributing
these changes back to spark, than forking is the way to go (although I
would recommend to keep an ongoing discussion with the maintainers, to
make sure your work will be merged back). Otherwise, I would recommend
to use "implicit extensions" to enrich your rdds instead. An easy
workaround to access spark-private fields is to simply define your
custom rdds in an "org.apache.spark" package as well ;)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-04 Thread Jakob Odersky

Hi everyone,

is there any ongoing discussion/documentation on the redesign of sinks?
I think it could be a good thing to abstract away the underlying
streaming model, however that isn't directly related to Holden's first
point. The way I understand it, is to slightly change the
DataStreamWriter API (the thing that's returned when you call
"df.writeStream") to allow passing in a custom sink provider instead
of only accepting strings. This would allow users to write their own
providers and sinks, and give them a strongly typed, possibly generic
way to do so. The sink api is currently available to users indirectly
(you can create your own sink provider and load it with the built-in
DataSource reflection functionality), therefore I don't quite
understand why exposing it indirectly through a typed interface should
be delayed before finalizing the API.
On a side note, I saw that sources have a similar limitation in that
they are currently only available through a stringly-typed interface.
Could a similar solution be applied to sources? Maybe the writer and
reader api's could even be unified to a certain degree.

Shivaram, I like your ideas on the proposed redesign! Can we discuss
this further?

cheers,
--Jakob


On Mon, Sep 26, 2016 at 5:12 PM, Shivaram Venkataraman
 wrote:
> Disclaimer - I am not very closely involved with Structured Streaming
> design / development, so this is just my two cents from looking at the
> discussion in the linked JIRAs and PRs.
>
> It seems to me there are a couple of issues being conflated here: (a)
> is the question of how to specify or add more functionality to the
> Sink API such as ability to get model updates back to the driver [A
> design issue IMHO] (b) question of how to pass parameters to
> DataFrameWriter, esp. strings vs. typed objects and whether the API is
> stable vs. experimental
>
> TLDR is that I think we should first focus on refactoring the Sink and
> add new functionality after that. Detailed comments below.
>
> Sink design / functionality: Looking at SPARK-10815, a JIRA linked
> from SPARK-16407, it looks like the existing Sink API is limited
> because it is tied to the RDD/Dataframe definitions. It also has
> surprising limitations like not being able to run operators on `data`
> and only using `collect/foreach`.  Given these limitations, I think it
> makes sense to redesign the Sink API first *before* adding new
> functionality to the existing Sink. I understand that we have not
> marked this experimental in 2.0.0 -- but I guess since
> StructuredStreaming is new as a whole, so we can probably break the
> Sink API in a upcoming 2.1.0 release.
>
> As a part of the redesign, I think we need to do two things: (i) come
> up with a new data handle that separates RDD from what is passed to
> the Sink (ii) Have some way to specify code that can run on the
> driver. This might not be an issue if the data handle already has
> clean abstraction for this.
>
> Micro-batching: Ideally it would be good to not expose the micro-batch
> processing model in the Sink API as this might change going forward.
> Given the consistency model we are presenting I think there will be
> some notion of batch / time-range identifier in the API. But I think
> if we can avoid having hard constraints on where functions will get
> run (i.e. on the driver vs. as a part of a job etc.) and when
> functions will get run (i.e. strictly after every micro-batch) it
> might give us more freedom in improving performance going forward [1].
>
> Parameter passing: I think your point that typed is better than
> untyped is pretty good and supporting both APIs isn't necessarily bad
> either. My understand of the discussion around this is that we should
> do this after Sink is refactored to avoid exposing the old APIs ?
>
> Thanks
> Shivaram
>
> [1] FWIW this is something I am looking at and
> https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/
> has some details about this.
>
>
> On Mon, Sep 26, 2016 at 1:38 PM, Holden Karau  wrote:
>> Hi Spark Developers,
>>
>>
>> After some discussion on SPARK-16407 (and on the PR) we’ve decided to jump
>> back to the developer list (SPARK-16407 itself comes from our early work on
>> SPARK-16424 to enable ML with the new Structured Streaming API). SPARK-16407
>> is proposing to extend the current DataStreamWriter API to allow users to
>> specify a specific instance of a StreamSinkProvider - this makes it easier
>> for users to create sinks that are configured with things besides strings
>> (for example things like lambdas). An example of something like this already
>> inside Spark is the ForeachSink.
>>
>>
>> We have been working on adding support for online learning in Structured
>> Streaming, similar to what Spark Streaming and MLLib provide today. Details
>> are available in  SPARK-16424. Along the way, we noticed that there is
>> currently no way for code running in the driver to access the

Re: Running Spark master/slave instances in non Daemon mode

2016-10-03 Thread Jakob Odersky

Hi Mike,
I can imagine the trouble that daemonization is causing and I think
that having non-forking start script is a good idea. A simple,
non-intrusive, fix could be to change the "spark-daemon.sh" script to
conditionally omit the "nohup &".
Personally, I think the semantically correct approach would be to also
rename "spark-daemon" to something else (since it won't necessarily
start a background process anymore), however that may have the
potential to break things, in which case it is probably not worth
cosmetic rename.

best,
--Jakob


On Thu, Sep 29, 2016 at 6:47 PM, Mike Ihbe <m...@mustwin.com> wrote:
> Our particular use case is for Nomad, using the "exec" configuration
> described here: https://www.nomadproject.io/docs/drivers/exec.html. It's not
> exactly a container, just a cgroup. It performs a simple fork/exec of a
> command and binds to the output fds from that process, so daemonizing is
> causing us minor hardship and seems like an easy thing to make optional.
> We'd be happy to make the PR as well.
>
> --Mike
>
> On Thu, Sep 29, 2016 at 5:25 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>
>> I'm curious, what kind of container solutions require foreground
>> processes? Most init systems work fine with "starter" processes that
>> run other processes. IIRC systemd and start-stop-daemon have an option
>> called "fork", that will expect the main process to run another one in
>> the background and only consider the former complete when the latter
>> exits. I'm not against having a non-forking start script, I'm just
>> wondering where you'd run into issues.
>>
>> Regarding the logging, would it be an option to create a custom slf4j
>> logger that uses the standard mechanisms exposed by the system?
>>
>> best,
>> --Jakob
>>
>> On Thu, Sep 29, 2016 at 3:46 PM, jpuro <jp...@mustwin.com> wrote:
>> > Hi,
>> >
>> > I recently tried deploying Spark master and slave instances to container
>> > based environments such as Docker, Nomad etc. There are two issues that
>> > I've
>> > found with how the startup scripts work. The sbin/start-master.sh and
>> > sbin/start-slave.sh start a daemon by default, but this isn't as
>> > compatible
>> > with container deployments as one would think. The first issue is that
>> > the
>> > daemon runs in the background and some container solutions require the
>> > apps
>> > to run in the foreground or they consider the application to not be
>> > running
>> > and they may close down the task. The second issue is that logs don't
>> > seem
>> > to get integrated with the logging mechanism in the container solution.
>> > What
>> > is the possibility of adding additional flags or startup scripts for
>> > supporting Spark to run in the foreground? It would be great if a flag
>> > like
>> > SPARK_NO_DAEMONIZE could be added or another script for foreground
>> > execution.
>> >
>> > Regards,
>> >
>> > Jeff
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-tp19172.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
>
> --
> Mike Ihbe
> MustWin - Principal
>
> m...@mustwin.com
> mikeji...@gmail.com
> skype: mikeihbe
> Cell: 651.283.0815

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: java.util.NoSuchElementException when serializing Map with default value

2016-10-03 Thread Jakob Odersky

Hi Kabeer,

which version of Spark are you using? I can't reproduce the error in
latest Spark master.

regards,
--Jakob

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread Jakob Odersky

I'm curious, what kind of container solutions require foreground
processes? Most init systems work fine with "starter" processes that
run other processes. IIRC systemd and start-stop-daemon have an option
called "fork", that will expect the main process to run another one in
the background and only consider the former complete when the latter
exits. I'm not against having a non-forking start script, I'm just
wondering where you'd run into issues.

Regarding the logging, would it be an option to create a custom slf4j
logger that uses the standard mechanisms exposed by the system?

best,
--Jakob

On Thu, Sep 29, 2016 at 3:46 PM, jpuro  wrote:
> Hi,
>
> I recently tried deploying Spark master and slave instances to container
> based environments such as Docker, Nomad etc. There are two issues that I've
> found with how the startup scripts work. The sbin/start-master.sh and
> sbin/start-slave.sh start a daemon by default, but this isn't as compatible
> with container deployments as one would think. The first issue is that the
> daemon runs in the background and some container solutions require the apps
> to run in the foreground or they consider the application to not be running
> and they may close down the task. The second issue is that logs don't seem
> to get integrated with the logging mechanism in the container solution. What
> is the possibility of adding additional flags or startup scripts for
> supporting Spark to run in the foreground? It would be great if a flag like
> SPARK_NO_DAEMONIZE could be added or another script for foreground
> execution.
>
> Regards,
>
> Jeff
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Running-Spark-master-slave-instances-in-non-Daemon-mode-tp19172.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Jakob Odersky

I agree with Sean's answer, you can check out the relevant serializer
here 
https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala

On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen  wrote:
> My guess is that Kryo specially handles Maps generically or relies on
> some mechanism that does, and it happens to iterate over all
> key/values as part of that and of course there aren't actually any
> key/values in the map. The Java serialization is a much more literal
> (expensive) field-by-field serialization which works here because
> there's no special treatment. I think you could register a custom
> serializer that handles this case. Or work around it in your client
> code. I know there have been other issues with Kryo and Map because,
> for example, sometimes a Map in an application is actually some
> non-serializable wrapper view.
>
> On Wed, Sep 28, 2016 at 3:18 AM, Maciej Szymkiewicz
>  wrote:
>> Hi everyone,
>>
>> I suspect there is no point in submitting a JIRA to fix this (not a Spark
>> issue?) but I would like to know if this problem is documented anywhere.
>> Somehow Kryo is loosing default value during serialization:
>>
>> scala> import org.apache.spark.{SparkContext, SparkConf}
>> import org.apache.spark.{SparkContext, SparkConf}
>>
>> scala> val aMap = Map[String, Long]().withDefaultValue(0L)
>> aMap: scala.collection.immutable.Map[String,Long] = Map()
>>
>> scala> aMap("a")
>> res6: Long = 0
>>
>> scala> val sc = new SparkContext(new
>> SparkConf().setAppName("bar").set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer"))
>>
>> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
>> 16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
>> java.util.NoSuchElementException: key not found: a
>>
>> while Java serializer works just fine:
>>
>> scala> val sc = new SparkContext(new
>> SparkConf().setAppName("bar").set("spark.serializer",
>> "org.apache.spark.serializer.JavaSerializer"))
>>
>> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
>> res9: Long = 0
>>
>> --
>> Best regards,
>> Maciej
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: What's the use of RangePartitioner.hashCode

2016-09-22 Thread Jakob Odersky

Hash codes should try to avoid collisions of objects that are not
equal. Integer overflowing is not an issue by itself

On Wed, Sep 21, 2016 at 10:49 PM, WangJianfei
 wrote:
> Than you very much sir!  but what i want to know is whether the hashcode
> overflow will make a trouble. thank you!
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-use-of-RangePartitioner-hashCode-tp18953p18996.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: What's the use of RangePartitioner.hashCode

2016-09-21 Thread Jakob Odersky

Andrew, you're correct of course hashing is a one-way operation with
potential collisions

On Wed, Sep 21, 2016 at 3:22 PM, Andrew Duffy <r...@aduffy.org> wrote:
> Pedantic note about hashCode and equals: the equality doesn't need to be
> bidirectional, you just need to ensure that a.hashCode == b.hashCode when
> a.equals(b), the bidirectional case is usually harder to satisfy due to
> possibility of collisions.
>
> Good info:
> http://www.programcreek.com/2011/07/java-equals-and-hashcode-contract/
> _____
> From: Jakob Odersky <ja...@odersky.com>
> Sent: Wednesday, September 21, 2016 15:12
> Subject: Re: What's the use of RangePartitioner.hashCode
> To: WangJianfei <wangjianfe...@otcaix.iscas.ac.cn>
> Cc: dev <dev@spark.apache.org>
>
>
>
> Hi,
> It is used jointly with a custom implementation of the `equals`
> method. In Scala, you can override the `equals` method to change the
> behaviour of `==` comparison. On example of this would be to compare
> classes based on their parameter values (i.e. what case classes do).
> Partitioners aren't case classes however it makes sense to have a
> value comparison between them (see RDD.subtract for an example) and
> hence they redefine the equals method.
> When redefining an equals method, it is good practice to also redefine
> the hashCode method so that `a == b` iff `a.hashCode == b.hashCode`
> (e.g. this is useful when your objects will be stored in a hash map).
> You can learn more about redefining the equals method and hashcodes
> here
> https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch04s16.html
>
>
> regards,
> --Jakob
>
> On Thu, Sep 15, 2016 at 6:17 PM, WangJianfei
> <wangjianfe...@otcaix.iscas.ac.cn> wrote:
>> who can give me an example of the use of RangePartitioner.hashCode, thank
>> you!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-use-of-RangePartitioner-hashCode-tp18953.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: What's the use of RangePartitioner.hashCode

2016-09-21 Thread Jakob Odersky

Hi,
It is used jointly with a custom implementation of the `equals`
method. In Scala, you can override the `equals` method to change the
behaviour of `==` comparison. On example of this would be to compare
classes based on their parameter values (i.e. what case classes do).
Partitioners aren't case classes however it makes sense to have a
value comparison between them (see RDD.subtract for an example) and
hence they redefine the equals method.
When redefining an equals method, it is good practice to also redefine
the hashCode method so that `a == b` iff `a.hashCode == b.hashCode`
(e.g. this is useful when your objects will be stored in a hash map).
You can learn more about redefining the equals method and hashcodes
here 
https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch04s16.html

regards,
--Jakob

On Thu, Sep 15, 2016 at 6:17 PM, WangJianfei
 wrote:
> who can give me an example of the use of RangePartitioner.hashCode, thank
> you!
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-use-of-RangePartitioner-hashCode-tp18953.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: java.lang.NoClassDefFoundError, is this a bug?

2016-09-21 Thread Jakob Odersky

Hi Xiang,

this error also appears in client mode (maybe the situation that you
were referring to and that worked was local mode?), however the error
is expected and is not a bug.

this line in your snippet:
object Main extends A[String] { //...
is, after desugaring, equivalent to:
object Main extends
A[String]()(Env.spark.implicits.newStringEncoder) { //...
Essentially, when the singleton object `Main` is initialised, it will
evaluate all its parameters, i.e. it will call
`Env.spark.implicitcs.newStringEncoder`. Since your `main` method is
also defined in this object, it will be initialised as soon as your
application starts, that is before a spark session is started. The
"problem" is that encoders require an active session and hence you
have an initialisation order problem. (You can replay the problem
simply by defining a `val x = Env.spark.implicits.newStringEncoder` in
your singleton object)

The error message is weird and not so helpful (I think this is due to
the way Spark uses ClassLoaders internally when running a submitted
application), however it isn't a bug in spark.

In local mode you will not experience the issue because you are
starting a session when the session builder is accessed the first time
via `Env.spark`.

Aside from the errors you're getting, there's another subtlety in your
snippet that may bite you later: the adding "T : Encoder" to your
super class has no effect with the current way that also imports
Env.spark.implicits._

best,
--Jakob

On Sat, Sep 17, 2016 at 8:26 PM, Xiang Gao  wrote:
> Yes. Besides, if you change the "T : Encoder" to "T", it OK too.
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-tp18972p18981.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Test fails when compiling spark with tests

2016-09-13 Thread Jakob Odersky

There are some flaky tests that occasionally fail, my first
recommendation would be to re-run the test suite. Another thing to
check is if there are any applications listening to spark's default
ports.
Btw, what is your environment like? In case it is windows, I don't
think tests are regularly run against that platform and therefore
could very well be broken.

On Sun, Sep 11, 2016 at 10:49 PM, assaf.mendelson
 wrote:
> Hi,
>
> I am trying to set up a spark development environment. I forked the spark
> git project and cloned the fork. I then checked out branch-2.0 tag (which I
> assume is the released source code).
>
> I then compiled spark twice.
>
> The first using:
>
> mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
>
> This compiled successfully.
>
> The second using mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 clean
> package
>
> This got a failure in Spark Project Core with the following test failing:
>
> - caching in memory and disk, replicated
>
> - caching in memory and disk, serialized, replicated *** FAILED ***
>
>   java.util.concurrent.TimeoutException: Can't find 2 executors before 3
> milliseconds elapsed
>
>   at
> org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:573)
>
>   at
> org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154)
>
>   at
> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
>
>   at
> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
>
>   at
> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
>
>   at
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>
>   ...
>
> - compute without caching when no partitions fit in memory
>
>
>
> I made no changes to the code whatsoever. Can anyone help me figure out what
> is wrong with my environment?
>
> BTW I am using maven 3.3.9 and java 1.8.0_101-b13
>
>
>
> Thanks,
>
> Assaf
>
>
> 
> View this message in context: Test fails when compiling spark with tests
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: @scala.annotation.varargs or @_root_.scala.annotation.varargs?

2016-09-08 Thread Jakob Odersky

+1 to Sean's answer, importing varargs.
In this case the _root_ is also unnecessary (it would be required in
case you were using it in a nested package called "scala" itself)

On Thu, Sep 8, 2016 at 9:27 AM, Sean Owen  wrote:
> I think the @_root_ version is redundant because
> @scala.annotation.varargs is redundant. Actually wouldn't we just
> import varargs and write @varargs?
>
> On Thu, Sep 8, 2016 at 1:24 PM, Jacek Laskowski  wrote:
>> Hi,
>>
>> The code is not consistent with @scala.annotation.varargs annotation.
>> There are classes with @scala.annotation.varargs like DataFrameReader
>> or functions as well as examples of @_root_.scala.annotation.varargs,
>> e.g. Window or UserDefinedAggregateFunction.
>>
>> I think it should be consistent and @scala.annotation.varargs only. WDYT?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: help getting started

2016-09-02 Thread Jakob Odersky

Hi Dayne,
you can look at this page for some starter issues:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened).
Also check out this guide on how to contribute to Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

regards,
--Jakob

On Fri, Sep 2, 2016 at 11:56 AM, dayne sorvisto  wrote:
> Hi,
>
> I'd like to request help from committers/contributors to work on some
> trivial bug fixes or documentation for the Spark project. I'm very
> interested in the machine learning side of things as I have a math
> background. I recently passed the databricks cert and feel I have a decent
> understanding of the key concepts I need to get started as a beginner
> contributor.  and I've signed up for a Jira account.
>
> Thank you
>
> On Fri, Sep 2, 2016 at 12:54 PM, dayne sorvisto 
> wrote:
>>
>> Hi,
>>
>> I'd like to request help from committers/contributors to work on some
>> trivial bug fixes or documentation for the Spark project. I'm very
>> interested in the machine learning side of things as I have a math
>> background. I recently passed the databricks cert and feel I have a decent
>> understanding of the key concepts I need to get started as a beginner
>> contributor.  and I've signed up for a Jira account.
>>
>> Thank you
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: SBT doesn't pick resource file after clean

2016-05-20 Thread Jakob Odersky

Ah, I think I see the issue. resourceManaged and core/src/resources
aren't included in the classpath; to achieve that, you need to scope
the setting to either "compile" or "test" (probably compile in your
case). So, the simplest way to add the extra settings would be
something like:

resourceGenerators in Compile += {
  val file = //generate properties file
  IO.copy(file, (resourceManaged in Compile).value  / "foo.properties")

On Thu, May 19, 2016 at 7:21 PM, dhruve ashar <dhruveas...@gmail.com> wrote:
> Based on the conversation on PR, the intent was not to pollute the source
> directory and hence we are placing the generated file outside it in the
> target/extra-resources directory. I agree that the "sbt way" is to add the
> generated resources under the resourceManaged setting which was essentially
> the earlier approach implemented.
>
> However, even on generating the  file under the default resourceDirectory =>
> core/src/resources doesn't pick the file in jar after doing a clean. So this
> seems to be a different issue.
>
>
>
>
>
> On Thu, May 19, 2016 at 4:17 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>
>> To echo my comment on the PR: I think the "sbt way" to add extra,
>> generated resources to the classpath is by adding a new task to the
>> `resourceGenerators` setting. Also, the task should output any files
>> into the directory specified by the `resourceManaged` setting. See
>> http://www.scala-sbt.org/0.13/docs/Howto-Generating-Files.html. There
>> shouldn't by any issues with clean if you follow the above
>> conventions.
>>
>> On Tue, May 17, 2016 at 12:00 PM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>> > Perhaps you need to make the "compile" task of the appropriate module
>> > depend on the task that generates the resource file?
>> >
>> > Sorry but my knowledge of sbt doesn't really go too far.
>> >
>> > On Tue, May 17, 2016 at 11:58 AM, dhruve ashar <dhruveas...@gmail.com>
>> > wrote:
>> >> We are trying to pick the spark version automatically from pom instead
>> >> of
>> >> manually modifying the files. This also includes richer pieces of
>> >> information like last commit, version, user who built the code etc to
>> >> better
>> >> identify the framework running.
>> >>
>> >> The setup is as follows :
>> >> - A shell script generates this piece of information and dumps it into
>> >> a
>> >> properties file under core/target/extra-resources - we don't want to
>> >> pollute
>> >> the source directory and hence we are generating this under target as
>> >> its
>> >> dealing with build information.
>> >> - The shell script is invoked in both mvn and sbt.
>> >>
>> >> The issue is that sbt doesn't pick up the generated properties file
>> >> after
>> >> doing a clean. But it does pick it up in subsequent runs. Note, the
>> >> properties file is created before the classes are generated.
>> >>
>> >> The code for this is available in the PR :
>> >> https://github.com/apache/spark/pull/13061
>> >>
>> >> Does anybody have an idea about how we can achieve this in sbt?
>> >>
>> >> Thanks,
>> >> Dhruve
>> >>
>> >
>> >
>> >
>> > --
>> > Marcelo
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>
>
>
>
> --
> -Dhruve Ashar
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SBT doesn't pick resource file after clean

2016-05-19 Thread Jakob Odersky

To echo my comment on the PR: I think the "sbt way" to add extra,
generated resources to the classpath is by adding a new task to the
`resourceGenerators` setting. Also, the task should output any files
into the directory specified by the `resourceManaged` setting. See
http://www.scala-sbt.org/0.13/docs/Howto-Generating-Files.html. There
shouldn't by any issues with clean if you follow the above
conventions.

On Tue, May 17, 2016 at 12:00 PM, Marcelo Vanzin  wrote:
> Perhaps you need to make the "compile" task of the appropriate module
> depend on the task that generates the resource file?
>
> Sorry but my knowledge of sbt doesn't really go too far.
>
> On Tue, May 17, 2016 at 11:58 AM, dhruve ashar  wrote:
>> We are trying to pick the spark version automatically from pom instead of
>> manually modifying the files. This also includes richer pieces of
>> information like last commit, version, user who built the code etc to better
>> identify the framework running.
>>
>> The setup is as follows :
>> - A shell script generates this piece of information and dumps it into a
>> properties file under core/target/extra-resources - we don't want to pollute
>> the source directory and hence we are generating this under target as its
>> dealing with build information.
>> - The shell script is invoked in both mvn and sbt.
>>
>> The issue is that sbt doesn't pick up the generated properties file after
>> doing a clean. But it does pick it up in subsequent runs. Note, the
>> properties file is created before the classes are generated.
>>
>> The code for this is available in the PR :
>> https://github.com/apache/spark/pull/13061
>>
>> Does anybody have an idea about how we can achieve this in sbt?
>>
>> Thanks,
>> Dhruve
>>
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Jakob Odersky

I just found out how the hash is calculated:

gpg --print-md sha512 .tgz

you can use that to check if the resulting output matches the contents
of .tgz.sha

On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com> wrote:
> The published hash is a SHA512.
>
> You can verify the integrity of the packages by running `sha512sum` on
> the archive and comparing the computed hash with the published one.
> Unfortunately however, I don't know what tool is used to generate the
> hash and I can't reproduce the format, so I ended up manually
> comparing the hashes.
>
> On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
>> An additional note: The Spark packages being served off of CloudFront (i.e.
>> the “direct download” option on spark.apache.org) are also corrupt.
>>
>> Btw what’s the correct way to verify the SHA of a Spark package? I’ve tried
>> a few commands on working packages downloaded from Apache mirrors, but I
>> can’t seem to reproduce the published SHA for spark-1.6.1-bin-hadoop2.6.tgz.
>>
>>
>> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>> Maybe temporarily take out the artifacts on S3 before the root cause is
>>> found.
>>>
>>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>>>
>>>> Just checking in on this again as the builds on S3 are still broken. :/
>>>>
>>>> Could it have something to do with us moving release-build.sh?
>>>>
>>>>
>>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>>> <nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>> Is someone going to retry fixing these packages? It's still a problem.
>>>>>
>>>>> Also, it would be good to understand why this is happening.
>>>>>
>>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:
>>>>>>
>>>>>> I just realized you're using a different download site. Sorry for the
>>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>>>>> Hadoop 2.6 is
>>>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>>>>
>>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>>>> <nicholas.cham...@gmail.com> wrote:
>>>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>>>>> > corrupt ZIP
>>>>>> > file.
>>>>>> >
>>>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>>>>>> > Spark
>>>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>>>>> >
>>>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> I just experienced the issue, however retrying the download a second
>>>>>> >> time worked. Could it be that there is some load balancer/cache in
>>>>>> >> front of the archive and some nodes still serve the corrupt
>>>>>> >> packages?
>>>>>> >>
>>>>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>>>>> >> <nicholas.cham...@gmail.com> wrote:
>>>>>> >> > I'm seeing the same. :(
>>>>>> >> >
>>>>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
>>>>>> >> > wrote:
>>>>>> >> >>
>>>>>> >> >> I tried again this morning :
>>>>>> >> >>
>>>>>> >> >> $ wget
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>>> >> >> --2016-03-18 07:55:30--
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>>>>> >> >> ...
>>>>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tg

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Jakob Odersky

The published hash is a SHA512.

You can verify the integrity of the packages by running `sha512sum` on
the archive and comparing the computed hash with the published one.
Unfortunately however, I don't know what tool is used to generate the
hash and I can't reproduce the format, so I ended up manually
comparing the hashes.

On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> An additional note: The Spark packages being served off of CloudFront (i.e.
> the “direct download” option on spark.apache.org) are also corrupt.
>
> Btw what’s the correct way to verify the SHA of a Spark package? I’ve tried
> a few commands on working packages downloaded from Apache mirrors, but I
> can’t seem to reproduce the published SHA for spark-1.6.1-bin-hadoop2.6.tgz.
>
>
> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> Maybe temporarily take out the artifacts on S3 before the root cause is
>> found.
>>
>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>>>
>>> Just checking in on this again as the builds on S3 are still broken. :/
>>>
>>> Could it have something to do with us moving release-build.sh?
>>>
>>>
>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>>>
>>>> Is someone going to retry fixing these packages? It's still a problem.
>>>>
>>>> Also, it would be good to understand why this is happening.
>>>>
>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:
>>>>>
>>>>> I just realized you're using a different download site. Sorry for the
>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>>>> Hadoop 2.6 is
>>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>>>
>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>>> <nicholas.cham...@gmail.com> wrote:
>>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>>>> > corrupt ZIP
>>>>> > file.
>>>>> >
>>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>>>>> > Spark
>>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>>>> >
>>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> I just experienced the issue, however retrying the download a second
>>>>> >> time worked. Could it be that there is some load balancer/cache in
>>>>> >> front of the archive and some nodes still serve the corrupt
>>>>> >> packages?
>>>>> >>
>>>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>>>> >> <nicholas.cham...@gmail.com> wrote:
>>>>> >> > I'm seeing the same. :(
>>>>> >> >
>>>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> I tried again this morning :
>>>>> >> >>
>>>>> >> >> $ wget
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >> --2016-03-18 07:55:30--
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>>>> >> >> ...
>>>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >>
>>>>> >> >> gzip: stdin: unexpected end of file
>>>>> >> >> tar: Unexpected EOF in archive
>>>>> >> >> tar: Unexpected EOF in archive
>>>>> >> >> tar: Error is not recoverable: exiting now
>>>>> >> >>
>>>>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>>>>> >> >> <mich...@databricks.com>
>>>>

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Jakob Odersky

I mean from the perspective of someone developing Spark, it makes
things more complicated. It's just my point of view, people that
actually support Spark deployments may have a different opinion ;)

On Thu, Mar 24, 2016 at 2:41 PM, Jakob Odersky <ja...@odersky.com> wrote:
> You can, but since it's going to be a maintainability issue I would
> argue it is in fact a problem.
>
> On Thu, Mar 24, 2016 at 2:34 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
>> Hi Jakob,
>>
>> On Thu, Mar 24, 2016 at 2:29 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>> Reynold's 3rd point is particularly strong in my opinion. Supporting
>>> Consider what would happen if Spark 2.0 doesn't require Java 8 and
>>> hence not support Scala 2.12. Will it be stuck on an older version
>>> until 3.0 is out?
>>
>> That's a false choice. You can support 2.10 (or 2.11) on Java 7 and
>> 2.12 on Java 8.
>>
>> I'm not saying it's a great idea, just that what you're suggesting
>> isn't really a problem.
>>
>> --
>> Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Jakob Odersky

You can, but since it's going to be a maintainability issue I would
argue it is in fact a problem.

On Thu, Mar 24, 2016 at 2:34 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
> Hi Jakob,
>
> On Thu, Mar 24, 2016 at 2:29 PM, Jakob Odersky <ja...@odersky.com> wrote:
>> Reynold's 3rd point is particularly strong in my opinion. Supporting
>> Consider what would happen if Spark 2.0 doesn't require Java 8 and
>> hence not support Scala 2.12. Will it be stuck on an older version
>> until 3.0 is out?
>
> That's a false choice. You can support 2.10 (or 2.11) on Java 7 and
> 2.12 on Java 8.
>
> I'm not saying it's a great idea, just that what you're suggesting
> isn't really a problem.
>
> --
> Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Jakob Odersky

Reynold's 3rd point is particularly strong in my opinion. Supporting
Scala 2.12 will require Java 8 anyway, and introducing such a change
is probably best done in a major release.
Consider what would happen if Spark 2.0 doesn't require Java 8 and
hence not support Scala 2.12. Will it be stuck on an older version
until 3.0 is out? Will it be introduced in a minor release?
I think 2.0 is the best time for such a change.

On Thu, Mar 24, 2016 at 11:46 AM, Stephen Boesch  wrote:
> +1 for java8 only   +1 for 2.11+ only .At this point scala libraries
> supporting only 2.10 are typically less active and/or poorly maintained.
> That trend will only continue when considering the lifespan of spark 2.X.
>
> 2016-03-24 11:32 GMT-07:00 Steve Loughran :
>>
>>
>> On 24 Mar 2016, at 15:27, Koert Kuipers  wrote:
>>
>> i think the arguments are convincing, but it also makes me wonder if i
>> live in some kind of alternate universe... we deploy on customers clusters,
>> where the OS, python version, java version and hadoop distro are not chosen
>> by us. so think centos 6, cdh5 or hdp 2.3, java 7 and python 2.6. we simply
>> have access to a single proxy machine and launch through yarn. asking them
>> to upgrade java is pretty much out of the question or a 6+ month ordeal. of
>> the 10 client clusters i can think of on the top of my head all of them are
>> on java 7, none are on java 8. so by doing this you would make spark 2
>> basically unusable for us (unless most of them have plans of upgrading in
>> near term to java 8, i will ask around and report back...).
>>
>>
>>
>> It's not actually mandatory for the process executing in the Yarn cluster
>> to run with the same JVM as the rest of the Hadoop stack; all that is needed
>> is for the environment variables to set up the JAVA_HOME and PATH. Switching
>> JVMs not something which YARN makes it easy to do, but it may be possible,
>> especially if Spark itself provides some hooks, so you don't have to
>> manually lay with setting things up. That may be something which could
>> significantly ease adoption of Spark 2 in YARN clusters. Same for Python.
>>
>> This is something I could probably help others to address
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Jakob Odersky

I just experienced the issue, however retrying the download a second
time worked. Could it be that there is some load balancer/cache in
front of the archive and some nodes still serve the corrupt packages?

On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
 wrote:
> I'm seeing the same. :(
>
> On Fri, Mar 18, 2016 at 10:57 AM Ted Yu  wrote:
>>
>> I tried again this morning :
>>
>> $ wget
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> --2016-03-18 07:55:30--
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> Resolving s3.amazonaws.com... 54.231.19.163
>> ...
>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>
>> gzip: stdin: unexpected end of file
>> tar: Unexpected EOF in archive
>> tar: Unexpected EOF in archive
>> tar: Error is not recoverable: exiting now
>>
>> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust 
>> wrote:
>>>
>>> Patrick reuploaded the artifacts, so it should be fixed now.
>>>
>>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" 
>>> wrote:

 Looks like the other packages may also be corrupt. I’m getting the same
 error for the Spark 1.6.1 / Hadoop 2.4 package.


 https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz

 Nick


 On Wed, Mar 16, 2016 at 8:28 PM Ted Yu  wrote:
>
> On Linux, I got:
>
> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>  wrote:
>>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>
>> Does anyone else have trouble unzipping this? How did this happen?
>>
>> What I get is:
>>
>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>
>> Seems like a strange type of problem to come across.
>>
>> Nick
>
>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Jakob Odersky

I just realized you're using a different download site. Sorry for the
confusion, the link I get for a direct download of Spark 1.6.1 /
Hadoop 2.6 is http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt ZIP
> file.
>
> Jakob, are you sure the ZIP unpacks correctly for you? Is it the same Spark
> 1.6.1/Hadoop 2.6 package you had a success with?
>
> On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com> wrote:
>>
>> I just experienced the issue, however retrying the download a second
>> time worked. Could it be that there is some load balancer/cache in
>> front of the archive and some nodes still serve the corrupt packages?
>>
>> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>> > I'm seeing the same. :(
>> >
>> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com> wrote:
>> >>
>> >> I tried again this morning :
>> >>
>> >> $ wget
>> >>
>> >> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> --2016-03-18 07:55:30--
>> >>
>> >> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> Resolving s3.amazonaws.com... 54.231.19.163
>> >> ...
>> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>> >>
>> >> gzip: stdin: unexpected end of file
>> >> tar: Unexpected EOF in archive
>> >> tar: Unexpected EOF in archive
>> >> tar: Error is not recoverable: exiting now
>> >>
>> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>> >> <mich...@databricks.com>
>> >> wrote:
>> >>>
>> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>> >>>
>> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>> >>> <nicholas.cham...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Looks like the other packages may also be corrupt. I’m getting the
>> >>>> same
>> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>> >>>>
>> >>>>
>> >>>>
>> >>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>> >>>>
>> >>>> Nick
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote:
>> >>>>>
>> >>>>> On Linux, I got:
>> >>>>>
>> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>> >>>>>
>> >>>>> gzip: stdin: unexpected end of file
>> >>>>> tar: Unexpected EOF in archive
>> >>>>> tar: Unexpected EOF in archive
>> >>>>> tar: Error is not recoverable: exiting now
>> >>>>>
>> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>> >>>>> <nicholas.cham...@gmail.com> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >>>>>>
>> >>>>>> Does anyone else have trouble unzipping this? How did this happen?
>> >>>>>>
>> >>>>>> What I get is:
>> >>>>>>
>> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>> >>>>>>
>> >>>>>> Seems like a strange type of problem to come across.
>> >>>>>>
>> >>>>>> Nick
>> >>>>>
>> >>>>>
>> >>
>> >

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Jakob Odersky

I would recommend (non-binding) option 1.

Apart from the API breakage I can see only advantages, and that sole
disadvantage is minimal for a few reasons:

1. the DataFrame API has been "Experimental" since its implementation,
so no stability was ever implied
2. considering that the change is for a major release some
incompatibilities are to be expected
3. using type aliases may break code now, but it will remove the
possibility of library incompatibilities in the future (see Reynold's
second point "[...] and we won't see type mismatches (e.g. a function
expects DataFrame, but user is passing in Dataset[Row]")

On Fri, Feb 26, 2016 at 11:51 AM, Reynold Xin  wrote:
> That's actually not Row vs non-Row.
>
> It's just primitive vs non-primitive. Primitives get automatically
> flattened, to avoid having to type ._1 all the time.
>
> On Fri, Feb 26, 2016 at 2:06 AM, Sun, Rui  wrote:
>>
>> Thanks for the explaination.
>>
>>
>>
>> What confusing me is the different internal semantic of Dataset on non-Row
>> type (primitive types for example) and Row type:
>>
>>
>>
>> Dataset[Int] is internally actually Dataset[Row(value:Int)]
>>
>>
>>
>> scala> val ds = sqlContext.createDataset(Seq(1,2,3))
>>
>> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
>>
>>
>>
>> scala> ds.schema.json
>>
>> res17: String =
>> {"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}
>>
>>
>>
>> But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].
>>
>>
>>
>> From: Reynold Xin [mailto:r...@databricks.com]
>> Sent: Friday, February 26, 2016 3:55 PM
>> To: Sun, Rui 
>> Cc: Koert Kuipers ; dev@spark.apache.org
>>
>>
>> Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0
>>
>>
>>
>> The join and joinWith are just two different join semantics, and is not
>> about Dataset vs DataFrame.
>>
>>
>>
>> join is the relational join, where fields are flattened; joinWith is more
>> like a tuple join, where the output has two fields that are nested.
>>
>>
>>
>> So you can do
>>
>>
>>
>> Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]
>>
>>
>> DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]
>>
>>
>>
>> Dataset[A] join Dataset[B] = Dataset[Row]
>>
>>
>>
>> DataFrame[A] join DataFrame[B] = Dataset[Row]
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui  wrote:
>>
>> Vote for option 2.
>>
>> Source compatibility and binary compatibility are very important from
>> user’s perspective.
>>
>> It ‘s unfair for Java developers that they don’t have DataFrame
>> abstraction. As you said, sometimes it is more natural to think about
>> DataFrame.
>>
>>
>>
>> I am wondering if conceptually there is slight subtle difference between
>> DataFrame and Dataset[Row]? For example,
>>
>> Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
>>
>> So,
>>
>> Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]
>>
>>
>>
>> While
>>
>> DataFrame join DataFrame is still DataFrame of Row?
>>
>>
>>
>> From: Reynold Xin [mailto:r...@databricks.com]
>> Sent: Friday, February 26, 2016 8:52 AM
>> To: Koert Kuipers 
>> Cc: dev@spark.apache.org
>> Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0
>>
>>
>>
>> Yes - and that's why source compatibility is broken.
>>
>>
>>
>> Note that it is not just a "convenience" thing. Conceptually DataFrame is
>> a Dataset[Row], and for some developers it is more natural to think about
>> "DataFrame" rather than "Dataset[Row]".
>>
>>
>>
>> If we were in C++, DataFrame would've been a type alias for Dataset[Row]
>> too, and some methods would return DataFrame (e.g. sql method).
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers  wrote:
>>
>> since a type alias is purely a convenience thing for the scala compiler,
>> does option 1 mean that the concept of DataFrame ceases to exist from a java
>> perspective, and they will have to refer to Dataset?
>>
>>
>>
>> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin  wrote:
>>
>> When we first introduced Dataset in 1.6 as an experimental API, we wanted
>> to merge Dataset/DataFrame but couldn't because we didn't want to break the
>> pre-existing DataFrame API (e.g. map function should return Dataset, rather
>> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
>> and Dataset.
>>
>>
>>
>> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
>> ways to implement this:
>>
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>>
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>>
>>
>>
>>
>> I'm wondering what you think about this. The pros and cons I can think of
>> are:
>>
>>
>>
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>>
>>
>> + Cleaner conceptually, especially in Scala. It will be very clear what
>> libraries or applications

Re: Scala 2.11 default build

2016-02-01 Thread Jakob Odersky

Awesome!
+1 on Steve Loughran's question, how does this affect support for
2.10? Do future contributions need to work with Scala 2.10?

cheers

On Mon, Feb 1, 2016 at 7:02 AM, Ted Yu  wrote:
> The following jobs have been established for build against Scala 2.10:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-MAVEN-SCALA-2.10/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.10/
>
> FYI
>
> On Mon, Feb 1, 2016 at 4:22 AM, Steve Loughran 
> wrote:
>>
>>
>> On 30 Jan 2016, at 08:22, Reynold Xin  wrote:
>>
>> FYI - I just merged Josh's pull request to switch to Scala 2.11 as the
>> default build.
>>
>> https://github.com/apache/spark/pull/10608
>>
>>
>>
>> does this mean that Spark 2.10 compatibility & testing are no longer
>> needed?
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: spark job scheduling

2016-01-27 Thread Jakob Odersky

Nitpick: the up-to-date version of said wiki page is
https://spark.apache.org/docs/1.6.0/job-scheduling.html (not sure how
much it changed though)

On Wed, Jan 27, 2016 at 7:50 PM, Chayapan Khannabha  wrote:
> I would start at this wiki page
> https://spark.apache.org/docs/1.2.0/job-scheduling.html
>
> Although I'm sure this depends a lot on your cluster environment and the
> deployed Spark version.
>
> IMHO
>
> On Thu, Jan 28, 2016 at 10:27 AM, Niranda Perera 
> wrote:
>>
>> Sorry I have made typos. let me rephrase
>>
>> 1. As I understand, the smallest unit of work an executor can perform, is
>> a 'task'. In the 'FAIR' scheduler mode, let's say a job is submitted to the
>> spark ctx which has a considerable amount of work to do in a single task.
>> While such a 'big' task is running, can we still submit another smaller job
>> (from a separate thread) and get it done? or does that smaller job has to
>> wait till the bigger task finishes and the resources are freed from the
>> executor?
>> (essentially, what I'm asking is, in the FAIR scheduler mode, jobs are
>> scheduled fairly, but at the task granularity they are still FIFO?)
>>
>> 2. When a job is submitted without setting a scheduler pool, the 'default'
>> scheduler pool is assigned to it, which employs FIFO scheduling. but what
>> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
>> without specifying a scheduler pool (which has FAIR scheduling)? would the
>> jobs still run in FIFO mode with the default pool?
>> essentially, for us to really set FAIR scheduling, do we have to assign a
>> FAIR scheduler pool also to the job?
>>
>> On Thu, Jan 28, 2016 at 8:47 AM, Chayapan Khannabha 
>> wrote:
>>>
>>> I think the smallest unit of work is a "Task", and an "Executor" is
>>> responsible for getting the work done? Would like to understand more about
>>> the scheduling system too. Scheduling strategy like FAIR or FIFO do have
>>> significant impact on a Spark cluster architecture design decision.
>>>
>>> Best,
>>>
>>> Chayapan (A)
>>>
>>> On Thu, Jan 28, 2016 at 10:07 AM, Niranda Perera
>>>  wrote:

 hi all,

 I have a few questions on spark job scheduling.

 1. As I understand, the smallest unit of work an executor can perform.
 In the 'fair' scheduler mode, let's say  a job is submitted to the spark 
 ctx
 which has a considerable amount of work to do in a task. While such a 'big'
 task is running, can we still submit another smaller job (from a separate
 thread) and get it done? or does that smaller job has to wait till the
 bigger task finishes and the resources are freed from the executor?

 2. When a job is submitted without setting a scheduler pool, the default
 scheduler pool is assigned to it, which employs FIFO scheduling. but what
 happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
 without specifying a scheduler pool (which has FAIR scheduling)? would the
 jobs still run in FIFO mode with the default pool?
 essentially, for us to really set FAIR scheduling, do we have to assign
 a FAIR scheduler pool?

 best

 --
 Niranda
 @n1r44
 +94-71-554-8430
 https://pythagoreanscript.wordpress.com/
>>>
>>>
>>
>>
>>
>> --
>> Niranda
>> @n1r44
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Mutiple spark contexts

2016-01-27 Thread Jakob Odersky

A while ago, I remember reading that multiple active Spark contexts
per JVM was a possible future enhancement.
I was wondering if this is still on the roadmap, what the major
obstacles are and if I can be of any help in adding this feature?

regards,
--Jakob

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Fastest way to build Spark from scratch

2015-12-07 Thread Jakob Odersky

make-distribution and the second code snippet both create a distribution
from a clean state. They therefore require that every source file be
compiled and that takes time (you can maybe tweak some settings or use a
newer compiler to gain some speed).

I'm inferring from your question that for your use-case deployment speed is
a critical issue, furthermore you'd like to build Spark for lots of
(every?) commit in a systematic way. In that case I would suggest you try
using the second code snippet without the `clean` task and only resort to
it if the build fails.

On my local machine, an assembly without a clean drops from 6 minutes to 2.

regards,
--Jakob

On 23 November 2015 at 20:18, Nicholas Chammas 
wrote:

> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
> fast as possible from scratch.
>
> This is what I’m doing at the moment:
>
> ./make-distribution.sh -T 1C -Phadoop-2.6
>
> -T 1C instructs Maven to spin up 1 thread per available core. This takes
> around 20 minutes on an m3.large instance.
>
> I see that spark-ec2, on the other hand, builds Spark as follows
> 
> when you deploy Spark at a specific git commit:
>
> sbt/sbt clean assembly
> sbt/sbt publish-local
>
> This seems slower than using make-distribution.sh, actually.
>
> Is there a faster way to do this?
>
> Nick
> 
>

Datasets on experimental dataframes?

2015-11-23 Thread Jakob Odersky

Hi,

datasets are being built upon the experimental DataFrame API, does this
mean DataFrames won't be experimental in the near future?

thanks,
--Jakob

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky

Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: State of the Build

2015-11-06 Thread Jakob Odersky

Reposting to the list...

Thanks for all the feedback everyone, I get a clearer picture of the
reasoning and implications now.

Koert, according to your post in this thread
http://apache-spark-developers-list.1001551.n3.nabble.com/Master-build-fails-tt14895.html#a15023,
it is apparently very easy to change the maven resolution mechanism to the
ivy one.
Patrick, would this not help with the problems you described?

On 5 November 2015 at 23:23, Patrick Wendell <pwend...@gmail.com> wrote:

> Hey Jakob,
>
> The builds in Spark are largely maintained by me, Sean, and Michael
> Armbrust (for SBT). For historical reasons, Spark supports both a Maven and
> SBT build. Maven is the build of reference for packaging Spark and is used
> by many downstream packagers and to build all Spark releases. SBT is more
> often used by developers. Both builds inherit from the same pom files (and
> rely on the same profiles) to minimize maintenance complexity of Spark's
> very complex dependency graph.
>
> If you are looking to make contributions that help with the build, I am
> happy to point you towards some things that are consistent maintenance
> headaches. There are two major pain points right now that I'd be thrilled
> to see fixes for:
>
> 1. SBT relies on a different dependency conflict resolution strategy than
> maven - causing all kinds of headaches for us. I have heard that newer
> versions of SBT can (maybe?) use Maven as a dependency resolver instead of
> Ivy. This would make our life so much better if it were possible, either by
> virtue of upgrading SBT or somehow doing this ourselves.
>
> 2. We don't have a great way of auditing the net effect of dependency
> changes when people make them in the build. I am working on a fairly clunky
> patch to do this here:
>
> https://github.com/apache/spark/pull/8531
>
> It could be done much more nicely using SBT, but only provided (1) is
> solved.
>
> Doing a major overhaul of the sbt build to decouple it from pom files, I'm
> not sure that's the best place to start, given that we need to continue to
> support maven - the coupling is intentional. But getting involved in the
> build in general would be completely welcome.
>
> - Patrick
>
> On Thu, Nov 5, 2015 at 10:53 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Maven isn't 'legacy', or supported for the benefit of third parties.
>> SBT had some behaviors / problems that Maven didn't relative to what
>> Spark needs. SBT is a development-time alternative only, and partly
>> generated from the Maven build.
>>
>> On Fri, Nov 6, 2015 at 1:48 AM, Koert Kuipers <ko...@tresata.com> wrote:
>> > People who do upstream builds of spark (think bigtop and hadoop
>> distros) are
>> > used to legacy systems like maven, so maven is the default build. I
>> don't
>> > think it will change.
>> >
>> > Any improvements for the sbt build are of course welcome (it is still
>> used
>> > by many developers), but i would not do anything that increases the
>> burden
>> > of maintaining two build systems.
>> >
>> > On Nov 5, 2015 18:38, "Jakob Odersky" <joder...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >> in the process of learning Spark, I wanted to get an overview of the
>> >> interaction between all of its sub-projects. I therefore decided to
>> have a
>> >> look at the build setup and its dependency management.
>> >> Since I am alot more comfortable using sbt than maven, I decided to
>> try to
>> >> port the maven configuration to sbt (with the help of automated tools).
>> >> This led me to a couple of observations and questions on the build
>> system
>> >> design:
>> >>
>> >> First, currently, there are two build systems, maven and sbt. Is there
>> a
>> >> preferred tool (or future direction to one)?
>> >>
>> >> Second, the sbt build also uses maven "profiles" requiring the use of
>> >> specific commandline parameters when starting sbt. Furthermore, since
>> it
>> >> relies on maven poms, dependencies to the scala binary version (_2.xx)
>> are
>> >> hardcoded and require running an external script when switching
>> versions.
>> >> Sbt could leverage built-in constructs to support cross-compilation and
>> >> emulate profiles with configurations and new build targets. This would
>> >> remove external state from the build (in that no extra steps need to be
>> >> performed in a particular order to generate artifacts for a new
>> >> configuration) and therefore improve stability and build
>> reproducibility
>> >> (maybe even build performance). I was wondering if implementing such
>> >> functionality for the sbt build would be welcome?
>> >>
>> >> thanks,
>> >> --Jakob
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

State of the Build

2015-11-05 Thread Jakob Odersky

Hi everyone,
in the process of learning Spark, I wanted to get an overview of the
interaction between all of its sub-projects. I therefore decided to have a
look at the build setup and its dependency management.
Since I am alot more comfortable using sbt than maven, I decided to try to
port the maven configuration to sbt (with the help of automated tools).
This led me to a couple of observations and questions on the build system
design:

First, currently, there are two build systems, maven and sbt. Is there a
preferred tool (or future direction to one)?

Second, the sbt build also uses maven "profiles" requiring the use of
specific commandline parameters when starting sbt. Furthermore, since it
relies on maven poms, dependencies to the scala binary version (_2.xx) are
hardcoded and require running an external script when switching versions.
Sbt could leverage built-in constructs to support cross-compilation and
emulate profiles with configurations and new build targets. This would
remove external state from the build (in that no extra steps need to be
performed in a particular order to generate artifacts for a new
configuration) and therefore improve stability and build reproducibility
(maybe even build performance). I was wondering if implementing such
functionality for the sbt build would be welcome?

thanks,
--Jakob

Re: Insight into Spark Packages

2015-10-16 Thread Jakob Odersky

[repost to mailing list]

I don't know much about packages, but have you heard about the
sbt-spark-package plugin?
Looking at the code, specifically
https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackagePlugin.scala,
might give you insight on the details about package creation. Package
submission is implemented in
https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackageHttp.scala

At a quick first overview, it seems packages are bundled as maven artifacts
and then posted to "http://spark-packages.org/api/submit-release;.

Hope this helps for your last question

On 16 October 2015 at 08:43, jeff saremi  wrote:

> I'm looking for any form of documentation on Spark Packages
> Specifically, what happens when one issues a command like the following:
>
>
> $SPARK_HOME/bin/spark-shell --packages RedisLabs:spark-redis:0.1.0
>
>
> Something like an architecture diagram.
> What happens when this package gets submitted?
> Does this need to be done each time?
> Is that package downloaded each time?
> Is there a persistent cache on the server (master i guess)?
> Can these packages be installed offline with no Internet connectivity?
> How does a package get created?
>
> and so on and so forth
>

Re: Spark Event Listener

2015-10-13 Thread Jakob Odersky

the path of the source file defining the event API is
`core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala`

On 13 October 2015 at 16:29, Jakob Odersky <joder...@gmail.com> wrote:

> Hi,
> I came across the spark listener API while checking out possible UI
> extensions recently. I noticed that all events inherit from a sealed trait
> `SparkListenerEvent` and that a SparkListener has a corresponding
> `onEventXXX(event)` method for every possible event.
>
> Considering that events inherit from a sealed trait and thus all events
> are known during compile-time, what is the rationale of using specific
> methods for every event rather than a single method that would let a client
> pattern match on the type of event?
>
> I don't know the internals of the pattern matcher, but again, considering
> events are sealed, I reckon that matching performance should not be an
> issue.
>
> thanks,
> --Jakob
>

Spark Event Listener

2015-10-13 Thread Jakob Odersky

Hi,
I came across the spark listener API while checking out possible UI
extensions recently. I noticed that all events inherit from a sealed trait
`SparkListenerEvent` and that a SparkListener has a corresponding
`onEventXXX(event)` method for every possible event.

Considering that events inherit from a sealed trait and thus all events are
known during compile-time, what is the rationale of using specific methods
for every event rather than a single method that would let a client pattern
match on the type of event?

I don't know the internals of the pattern matcher, but again, considering
events are sealed, I reckon that matching performance should not be an
issue.

thanks,
--Jakob

Live UI

2015-10-12 Thread Jakob Odersky

Hi everyone,
I am just getting started working on spark and was thinking of a first way
to contribute whilst still trying to wrap my head around the codebase.

Exploring the web UI, I noticed it is a classic request-response website,
requiring manual refresh to get the latest data.
I think it would be great to have a "live" website where data would be
displayed real-time without the need to hit the refresh button. I would be
very interested in contributing this feature if it is acceptable.

Specifically, I was thinking of using websockets with a ScalaJS front-end.
Please let me know if this design would be welcome or if it introduces
unwanted dependencies, I'll be happy to discuss this further in detail.

thanks for your feedback,
--Jakob

Re: Can I add a new method to RDD class?

Re: Can I add a new method to RDD class?

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Re: Running Spark master/slave instances in non Daemon mode

Re: java.util.NoSuchElementException when serializing Map with default value

Re: Running Spark master/slave instances in non Daemon mode

Re: java.util.NoSuchElementException when serializing Map with default value

Re: What's the use of RangePartitioner.hashCode

Re: What's the use of RangePartitioner.hashCode

Re: What's the use of RangePartitioner.hashCode

Re: java.lang.NoClassDefFoundError, is this a bug?

Re: Test fails when compiling spark with tests

Re: @scala.annotation.varargs or @_root_.scala.annotation.varargs?

Re: help getting started

Re: SBT doesn't pick resource file after clean

Re: SBT doesn't pick resource file after clean

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

Re: [discuss] ending support for Java 7 in Spark 2.0

Re: [discuss] ending support for Java 7 in Spark 2.0

Re: [discuss] ending support for Java 7 in Spark 2.0

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Re: Scala 2.11 default build

Re: spark job scheduling

Mutiple spark contexts

Re: Fastest way to build Spark from scratch

Datasets on experimental dataframes?

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

Re: State of the Build

State of the Build

Re: Insight into Spark Packages

Re: Spark Event Listener

Spark Event Listener

Live UI

36 matches

Site Navigation

Mail list logo

Footer information