Re: Opening up metrics interfaces

2015-08-27 Thread Thomas Dudziak
+1. I'd love to simply define a timer in my code (maybe metrics-scala ?)
using Spark's metrics registry. Also maybe switch to the newer version
(io.dropwizard.metrics) ?

On Thu, Aug 27, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote:

 I'd like this to happen, but it hasn't been super high priority on
 anybody's mind.

 There are a couple things that could be good to do:

 1. At the application level: consolidate task metrics and accumulators.
 They have substantial overlap, and from high level should just be
 consolidated. Maybe there are some differences in semantics w.r.t. retries
 or fault-tolerance, but those can be just modes in the consolidated
 interface/implementation.

 Once we do that, then users effectively can use the new consolidated
 interface to add new metrics.

 2. At the process/service monitoring level: expose an internal metrics
 interface to make it easier to create new metrics and publish them via a
 rest interface. Last time I looked at this (~4 weeks ago), publication of
 the current metrics was not as straightforward as I was hoping for. We use
 the codahale library only in some places (IIRC just the cluster manager,
 but not the actual executors). It'd make sense to create a simple wrapper
 for the coda hale library and make it easier to create new metrics.


 On Thu, Aug 27, 2015 at 12:21 PM, Atsu Kakitani atkakit...@groupon.com
 wrote:

 Hi,

 I was wondering if there are any plans to open up the API for Spark's
 metrics system. I want to write custom sources and sinks, but these
 interfaces aren't public right now. I saw that there was also an issue open
 for this (https://issues.apache.org/jira/browse/SPARK-5630), but it
 hasn't been addressed - is there a reason why these interfaces are kept
 private?

 Thanks,
 Atsu





Re: Developer API plugins for Hive Hadoop ?

2015-08-13 Thread Thomas Dudziak
Unfortunately it doesn't because our version of Hive has different syntax
elements and thus I need to patch them in (and a few other minor things).
It would be great if there would be a developer api on a somewhat higher
level.

On Thu, Aug 13, 2015 at 2:19 PM, Reynold Xin r...@databricks.com wrote:

 I believe for Hive, there is already a client interface that can be used
 to build clients for different Hive metastores. That should also work for
 your heavily forked one.

 For Hadoop, it is definitely a bigger project to refactor. A good way to
 start evaluating this is to list what needs to be changed. Maybe you can
 start by telling us what you need to change for every upgrade? Feel free to
 email me in private if this is sensitive and you don't want to share in a
 public list.






 On Thu, Aug 13, 2015 at 2:01 PM, Thomas Dudziak tom...@gmail.com wrote:

 Hi,

 I have asked this before but didn't receive any comments, but with the
 impending release of 1.5 I wanted to bring this up again.
 Right now, Spark is very tightly coupled with OSS Hive  Hadoop which
 causes me a lot of work every time there is a new version because I don't
 run OSS Hive/Hadoop versions (and before you ask, I can't).

 My question is, does Spark need to be so tightly coupled with these two ?
 Or put differently, would it be possible to introduce a developer API
 between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits)
 and Hive (e.g. HiveContext and beyond) and move the actual Hadoop  Hive
 dependencies into plugins (e.g. separate maven modules)?
 This would allow me to easily maintain my own Hive/Hadoop-ish integration
 with our internal systems without ever having to touch Spark code.
 I expect this could also allow for instance Hadoop vendors to provide
 their own, more optimized implementations without Spark having to know
 about them.

 cheers,
 Tom





Developer API plugins for Hive Hadoop ?

2015-08-13 Thread Thomas Dudziak
Hi,

I have asked this before but didn't receive any comments, but with the
impending release of 1.5 I wanted to bring this up again.
Right now, Spark is very tightly coupled with OSS Hive  Hadoop which
causes me a lot of work every time there is a new version because I don't
run OSS Hive/Hadoop versions (and before you ask, I can't).

My question is, does Spark need to be so tightly coupled with these two ?
Or put differently, would it be possible to introduce a developer API
between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits)
and Hive (e.g. HiveContext and beyond) and move the actual Hadoop  Hive
dependencies into plugins (e.g. separate maven modules)?
This would allow me to easily maintain my own Hive/Hadoop-ish integration
with our internal systems without ever having to touch Spark code.
I expect this could also allow for instance Hadoop vendors to provide their
own, more optimized implementations without Spark having to know about them.

cheers,
Tom


Hive 0.12 support in 1.4.0 ?

2015-06-17 Thread Thomas Dudziak
So I'm a little confused, has Hive 0.12 support disappeared in 1.4.0 ? The
release notes didn't mention anything, but the documentation doesn't list a
way to build for 0.12 anymore (
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support,
in fact it doesn't list anything other than 0.13), and I don't see any
maven profiles nor code for 0.12.

Tom


Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Thomas Dudziak
0.23 (and hive 0.12) code base in Spark works well from our perspective, so
not sure what you are referring to. As I said, I'm happy to maintain my own
plugins but as it stands there is no sane way to do so in Spark because
there is no clear separation/developer APIs for these.

cheers,
Tom

On Fri, Jun 12, 2015 at 11:21 AM, Sean Owen so...@cloudera.com wrote:

 I don't imagine that can be guaranteed to be supported anyway... the
 0.x branch has never necessarily worked with Spark, even if it might
 happen to. Is this really something you would veto for everyone
 because of your deployment?

 On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote:
  -1 to this, we use it with an old Hadoop version (well, a fork of an old
  version, 0.23). That being said, if there were a nice developer api that
  separates Spark from Hadoop (or rather, two APIs, one for scheduling and
 one
  for HDFS), then we'd be happy to maintain our own plugins for those.
 
  cheers,
  Tom
 



Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Thomas Dudziak
-1 to this, we use it with an old Hadoop version (well, a fork of an old
version, 0.23). That being said, if there were a nice developer api that
separates Spark from Hadoop (or rather, two APIs, one for scheduling and
one for HDFS), then we'd be happy to maintain our own plugins for those.

cheers,
Tom

On Fri, Jun 12, 2015 at 9:42 AM, Sean Owen so...@cloudera.com wrote:

 On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  I would like to understand though Sean - what is the proposal exactly?
  Hadoop 2 itself supports all of the Hadoop 1 API's, so things like
  removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think

 Not entirely; you can see some binary incompatibilities that have
 bitten recently. A Hadoop 1 program does not in general work on Hadoop
 2 because of this.

 Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully
 works anymore anyway. See for example SPARK-8057 recently. I recall
 similar problems with Hadoop 2.0.x-era releases and the Spark build
 for that which is basically the 'cdh4' build.

 So one benefit is skipping whatever work would be needed to continue
 to fix this up, and, the argument is there may be less loss of
 functionality than it seems. The other is being able to use later
 APIs. This much is a little minor.


  The main reason I'd push back is that I do think there are still
  people running the older versions. For instance at Databricks we use
  the FileSystem library for talking to S3... every time we've tried to
  upgrade to Hadoop 2.X there have been significant regressions in
  performance and we've had to downgrade. That's purely anecdotal, but I
  think you have people out there using the Hadoop 1 bindings for whom
  upgrade would be a pain.

 Yeah, that's the question. Is anyone out there using 1.x? More
 anecdotes wanted. That might be the most interesting question.

 No CDH customers would have been for a long while now, for example.
 (Still a small number of CDH 4 customers out there though, and that's
 2.0.x or so, but that's a gray area.)

 Is the S3 library thing really related to Hadoop 1.x? that comes from
 jets3t and that's independent.


  In terms of our maintenance cost, to me the much bigger cost for us
  IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where
  major new API's were added. In comparison the Hadoop 1 vs 2 seems

 Really? I'd say the opposite. No APIs that are only in 2.2, let alone
 only in a later version, can be in use now, right? 1.x wouldn't work
 at all then. I don't know of any binary incompatibilities of the type
 between 1.x and 2.x, which we have had to shim to make work.

 In both cases dependencies have to be harmonized here and there, yes.
 That won't change.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org