Re: Opening up metrics interfaces
+1. I'd love to simply define a timer in my code (maybe metrics-scala ?) using Spark's metrics registry. Also maybe switch to the newer version (io.dropwizard.metrics) ? On Thu, Aug 27, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote: I'd like this to happen, but it hasn't been super high priority on anybody's mind. There are a couple things that could be good to do: 1. At the application level: consolidate task metrics and accumulators. They have substantial overlap, and from high level should just be consolidated. Maybe there are some differences in semantics w.r.t. retries or fault-tolerance, but those can be just modes in the consolidated interface/implementation. Once we do that, then users effectively can use the new consolidated interface to add new metrics. 2. At the process/service monitoring level: expose an internal metrics interface to make it easier to create new metrics and publish them via a rest interface. Last time I looked at this (~4 weeks ago), publication of the current metrics was not as straightforward as I was hoping for. We use the codahale library only in some places (IIRC just the cluster manager, but not the actual executors). It'd make sense to create a simple wrapper for the coda hale library and make it easier to create new metrics. On Thu, Aug 27, 2015 at 12:21 PM, Atsu Kakitani atkakit...@groupon.com wrote: Hi, I was wondering if there are any plans to open up the API for Spark's metrics system. I want to write custom sources and sinks, but these interfaces aren't public right now. I saw that there was also an issue open for this (https://issues.apache.org/jira/browse/SPARK-5630), but it hasn't been addressed - is there a reason why these interfaces are kept private? Thanks, Atsu
Re: Developer API plugins for Hive Hadoop ?
Unfortunately it doesn't because our version of Hive has different syntax elements and thus I need to patch them in (and a few other minor things). It would be great if there would be a developer api on a somewhat higher level. On Thu, Aug 13, 2015 at 2:19 PM, Reynold Xin r...@databricks.com wrote: I believe for Hive, there is already a client interface that can be used to build clients for different Hive metastores. That should also work for your heavily forked one. For Hadoop, it is definitely a bigger project to refactor. A good way to start evaluating this is to list what needs to be changed. Maybe you can start by telling us what you need to change for every upgrade? Feel free to email me in private if this is sensitive and you don't want to share in a public list. On Thu, Aug 13, 2015 at 2:01 PM, Thomas Dudziak tom...@gmail.com wrote: Hi, I have asked this before but didn't receive any comments, but with the impending release of 1.5 I wanted to bring this up again. Right now, Spark is very tightly coupled with OSS Hive Hadoop which causes me a lot of work every time there is a new version because I don't run OSS Hive/Hadoop versions (and before you ask, I can't). My question is, does Spark need to be so tightly coupled with these two ? Or put differently, would it be possible to introduce a developer API between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits) and Hive (e.g. HiveContext and beyond) and move the actual Hadoop Hive dependencies into plugins (e.g. separate maven modules)? This would allow me to easily maintain my own Hive/Hadoop-ish integration with our internal systems without ever having to touch Spark code. I expect this could also allow for instance Hadoop vendors to provide their own, more optimized implementations without Spark having to know about them. cheers, Tom
Developer API plugins for Hive Hadoop ?
Hi, I have asked this before but didn't receive any comments, but with the impending release of 1.5 I wanted to bring this up again. Right now, Spark is very tightly coupled with OSS Hive Hadoop which causes me a lot of work every time there is a new version because I don't run OSS Hive/Hadoop versions (and before you ask, I can't). My question is, does Spark need to be so tightly coupled with these two ? Or put differently, would it be possible to introduce a developer API between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits) and Hive (e.g. HiveContext and beyond) and move the actual Hadoop Hive dependencies into plugins (e.g. separate maven modules)? This would allow me to easily maintain my own Hive/Hadoop-ish integration with our internal systems without ever having to touch Spark code. I expect this could also allow for instance Hadoop vendors to provide their own, more optimized implementations without Spark having to know about them. cheers, Tom
Hive 0.12 support in 1.4.0 ?
So I'm a little confused, has Hive 0.12 support disappeared in 1.4.0 ? The release notes didn't mention anything, but the documentation doesn't list a way to build for 0.12 anymore ( http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support, in fact it doesn't list anything other than 0.13), and I don't see any maven profiles nor code for 0.12. Tom
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
0.23 (and hive 0.12) code base in Spark works well from our perspective, so not sure what you are referring to. As I said, I'm happy to maintain my own plugins but as it stands there is no sane way to do so in Spark because there is no clear separation/developer APIs for these. cheers, Tom On Fri, Jun 12, 2015 at 11:21 AM, Sean Owen so...@cloudera.com wrote: I don't imagine that can be guaranteed to be supported anyway... the 0.x branch has never necessarily worked with Spark, even if it might happen to. Is this really something you would veto for everyone because of your deployment? On Fri, Jun 12, 2015 at 7:18 PM, Thomas Dudziak tom...@gmail.com wrote: -1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those. cheers, Tom
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
-1 to this, we use it with an old Hadoop version (well, a fork of an old version, 0.23). That being said, if there were a nice developer api that separates Spark from Hadoop (or rather, two APIs, one for scheduling and one for HDFS), then we'd be happy to maintain our own plugins for those. cheers, Tom On Fri, Jun 12, 2015 at 9:42 AM, Sean Owen so...@cloudera.com wrote: On Fri, Jun 12, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I would like to understand though Sean - what is the proposal exactly? Hadoop 2 itself supports all of the Hadoop 1 API's, so things like removing the Hadoop 1 variant of sc.hadoopFile, etc, I don't think Not entirely; you can see some binary incompatibilities that have bitten recently. A Hadoop 1 program does not in general work on Hadoop 2 because of this. Part of my thinking is that I'm not clear Hadoop 1.x, and 2.0.x, fully works anymore anyway. See for example SPARK-8057 recently. I recall similar problems with Hadoop 2.0.x-era releases and the Spark build for that which is basically the 'cdh4' build. So one benefit is skipping whatever work would be needed to continue to fix this up, and, the argument is there may be less loss of functionality than it seems. The other is being able to use later APIs. This much is a little minor. The main reason I'd push back is that I do think there are still people running the older versions. For instance at Databricks we use the FileSystem library for talking to S3... every time we've tried to upgrade to Hadoop 2.X there have been significant regressions in performance and we've had to downgrade. That's purely anecdotal, but I think you have people out there using the Hadoop 1 bindings for whom upgrade would be a pain. Yeah, that's the question. Is anyone out there using 1.x? More anecdotes wanted. That might be the most interesting question. No CDH customers would have been for a long while now, for example. (Still a small number of CDH 4 customers out there though, and that's 2.0.x or so, but that's a gray area.) Is the S3 library thing really related to Hadoop 1.x? that comes from jets3t and that's independent. In terms of our maintenance cost, to me the much bigger cost for us IMO is dealing with differences between e.g. 2.2, 2.4, and 2.6 where major new API's were added. In comparison the Hadoop 1 vs 2 seems Really? I'd say the opposite. No APIs that are only in 2.2, let alone only in a later version, can be in use now, right? 1.x wouldn't work at all then. I don't know of any binary incompatibilities of the type between 1.x and 2.x, which we have had to shim to make work. In both cases dependencies have to be harmonized here and there, yes. That won't change. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org