Fair scheduler pool leak

2018-04-05 Thread Matthias Boehm
Hi all,

for concurrent Spark jobs spawned from the driver, we use Spark's fair
scheduler pools, which are set and unset in a thread-local manner by
each worker thread. Typically (for rather long jobs), this works very
well. Unfortunately, in an application with lots of very short
parallel sections, we see 1000s of these pools remaining in the Spark
UI, which indicates some kind of leak. Each worker cleans up its local
property by setting it to null, but not all pools are properly
removed. I've checked and reproduced this behavior with Spark 2.1-2.3.

Now my question: Is there a way to explicitly remove these pools,
either globally, or locally while the thread is still alive?

Regards,
Matthias

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-05 Thread Liang-Chi Hsieh
Congratulations! Zhenhua Wang



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Marcelo Vanzin
On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia  wrote:
> Sorry, but just to be clear here, this is the 2.12 API issue: 
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
> doc: 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
>
> Basically, if we are allowed to change Spark’s API a little to have only one 
> version of methods that are currently overloaded between Java and Scala, we 
> can get away with a single source three for all Scala versions and Java ABI 
> compatibility against any type of Spark (whether using Scala 2.11 or 2.12).

Fair enough. To play devil's advocate, most of those methods seem to
be marked "Experimental / Evolving", which could be used as a reason
to change them for this purpose in a minor release.

Not all of them are, though (e.g. foreach / foreachPartition are not
experimental).

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Steve Loughran


On 5 Apr 2018, at 18:04, Matei Zaharia 
> wrote:

Java 9/10 support would be great to add as well.

Be aware that the work moving hadoop core to java 9+ is still a big piece of 
work being undertaken by Akira Ajisaka & colleagues at NTT

https://issues.apache.org/jira/browse/HADOOP-11123

Big dependency updates and handling Oracle hiding sun.misc stuff which low 
level code depends on are the troublespots, with a move to Log4J 2 going to be 
observably traumatic to all apps which require a log4.properties to set 
themselves up. As usual: any testing which can be done early will be welcomed 
by all, the earlier the better

That stuff is all about getting things working: supporting the java 9 packaging 
model. Which is a really compelling reason to go for it


Regarding Scala 2.12, I thought that supporting it would become easier if we 
change the Spark API and ABI slightly. Basically, it is of course possible to 
create an alternate source tree today, but it might be possible to share the 
same source files if we tweak some small things in the methods that are 
overloaded across Scala and Java. I don’t remember the exact details, but the 
idea was to reduce the total maintenance work needed at the cost of requiring 
users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as 
well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency 
shading (a major pain point for lots of users)

Hadoop 3 does have a shaded client, though not enough for Spark; if work 
identifying & fixing the outstanding dependencies is started now, Hadoop 3.2 
should be able to offer the set of shaded libraries needed by Spark.

There's always a price to that, which is in redistributable size and it's 
impact on start times, duplicate classes loaded (memory,  reduced chance of JIT 
recompilation, ...), and the whole transitive-shading problem. Java 9 should be 
the real target for a clean solution to all of this.


Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Oh, forgot to add, but splitting the source tree in Scala also creates the 
issue of a big maintenance burden for third-party libraries built on Spark. As 
Josh said on the JIRA:

"I think this is primarily going to be an issue for end users who want to use 
an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus 
the pain of the source incompatibility would mostly be felt by library/package 
maintainers but it can be worked around as long as there's at least some common 
subset which is source compatible across all of those versions.”

This means that all the data sources, ML algorithms, etc developed outside our 
source tree would have to do the same thing we do internally.

> On Apr 5, 2018, at 10:30 AM, Matei Zaharia  wrote:
> 
> Sorry, but just to be clear here, this is the 2.12 API issue: 
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
> doc: 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> 
> Basically, if we are allowed to change Spark’s API a little to have only one 
> version of methods that are currently overloaded between Java and Scala, we 
> can get away with a single source three for all Scala versions and Java ABI 
> compatibility against any type of Spark (whether using Scala 2.11 or 2.12). 
> On the other hand, if we want to keep the API and ABI of the Spark 2.x 
> branch, we’ll need a different source tree for Scala 2.12 with different 
> copies of pretty large classes such as RDD, DataFrame and DStream, and Java 
> users may have to change their code when linking against different versions 
> of Spark.
> 
> This is of course only one of the possible ABI changes, but it is a 
> considerable engineering effort, so we’d have to sign up for maintaining all 
> these different source files. It seems kind of silly given that Scala 2.12 
> was released in 2016, so we’re doing all this work to keep ABI compatibility 
> for Scala 2.11, which isn’t even that widely used any more for new projects. 
> Also keep in mind that the next Spark release will probably take at least 3-4 
> months, so we’re talking about what people will be using in fall 2018.
> 
> Matei
> 
>> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin  wrote:
>> 
>> I remember seeing somewhere that Scala still has some issues with Java
>> 9/10 so that might be hard...
>> 
>> But on that topic, it might be better to shoot for Java 11
>> compatibility. 9 and 10, following the new release model, aren't
>> really meant to be long-term releases.
>> 
>> In general, agree with Sean here. Doesn't look like 2.12 support
>> requires unexpected API breakages. So unless there's a really good
>> reason to break / remove a bunch of existing APIs...
>> 
>> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
>>> Hi all,
>>> 
>>> I also agree with Mark that we should add Java 9/10 support to an eventual
>>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>>> are using some internal APIs for the memory management which changed: either
>>> we find a solution which works on both (but I am not sure it is feasible) or
>>> we have to switch between 2 implementations according to the Java version.
>>> So I'd rather avoid doing this in a non-major release.
>>> 
>>> Thanks,
>>> Marco
>>> 
>>> 
>>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
 
 As with Sean, I'm not sure that this will require a new major version, but
 we should also be looking at Java 9 & 10 support -- particularly with 
 regard
 to their better functionality in a containerized environment (memory limits
 from cgroups, not sysconf; support for cpusets). In that regard, we should
 also be looking at using the latest Scala 2.11.x maintenance release in
 current Spark branches.
 
 On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
> 
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>> 
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other 
>> changes
>> that we know have been biting us for a long time but can’t be changed in
>> feature releases (to be clear, I’m actually not sure they are all good
>> ideas, but I’m writing them down as candidates for consideration):
> 
> 
> IIRC from looking at this, it is possible to support 2.11 and 2.12
> simultaneously. The cross-build already works now in 2.3.0. Barring some 
> big
> change needed to get 2.12 fully working -- and that may be the case -- it
> nearly works that way now.
> 
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
> in byte code. However Scala itself isn't mutually compatible 

Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Sorry, but just to be clear here, this is the 2.12 API issue: 
https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
doc: 
https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.

Basically, if we are allowed to change Spark’s API a little to have only one 
version of methods that are currently overloaded between Java and Scala, we can 
get away with a single source three for all Scala versions and Java ABI 
compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On 
the other hand, if we want to keep the API and ABI of the Spark 2.x branch, 
we’ll need a different source tree for Scala 2.12 with different copies of 
pretty large classes such as RDD, DataFrame and DStream, and Java users may 
have to change their code when linking against different versions of Spark.

This is of course only one of the possible ABI changes, but it is a 
considerable engineering effort, so we’d have to sign up for maintaining all 
these different source files. It seems kind of silly given that Scala 2.12 was 
released in 2016, so we’re doing all this work to keep ABI compatibility for 
Scala 2.11, which isn’t even that widely used any more for new projects. Also 
keep in mind that the next Spark release will probably take at least 3-4 
months, so we’re talking about what people will be using in fall 2018.

Matei

> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin  wrote:
> 
> I remember seeing somewhere that Scala still has some issues with Java
> 9/10 so that might be hard...
> 
> But on that topic, it might be better to shoot for Java 11
> compatibility. 9 and 10, following the new release model, aren't
> really meant to be long-term releases.
> 
> In general, agree with Sean here. Doesn't look like 2.12 support
> requires unexpected API breakages. So unless there's a really good
> reason to break / remove a bunch of existing APIs...
> 
> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
>> Hi all,
>> 
>> I also agree with Mark that we should add Java 9/10 support to an eventual
>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>> are using some internal APIs for the memory management which changed: either
>> we find a solution which works on both (but I am not sure it is feasible) or
>> we have to switch between 2 implementations according to the Java version.
>> So I'd rather avoid doing this in a non-major release.
>> 
>> Thanks,
>> Marco
>> 
>> 
>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
>>> 
>>> As with Sean, I'm not sure that this will require a new major version, but
>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>> to their better functionality in a containerized environment (memory limits
>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>> current Spark branches.
>>> 
>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
 
 On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
> 
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other 
> changes
> that we know have been biting us for a long time but can’t be changed in
> feature releases (to be clear, I’m actually not sure they are all good
> ideas, but I’m writing them down as candidates for consideration):
 
 
 IIRC from looking at this, it is possible to support 2.11 and 2.12
 simultaneously. The cross-build already works now in 2.3.0. Barring some 
 big
 change needed to get 2.12 fully working -- and that may be the case -- it
 nearly works that way now.
 
 Compiling vs 2.11 and 2.12 does however result in some APIs that differ
 in byte code. However Scala itself isn't mutually compatible between 2.11
 and 2.12 anyway; that's never been promised as compatible.
 
 (Interesting question about what *Java* users should expect; they would
 see a difference in 2.11 vs 2.12 Spark APIs, but that has always been 
 true.)
 
 I don't disagree with shooting for Spark 3.0, just saying I don't know if
 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
 2.11 support if needed to make supporting 2.12 less painful.
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



subscribe

2018-04-05 Thread Chao Sun



Re: time for Apache Spark 3.0?

2018-04-05 Thread Marcelo Vanzin
I remember seeing somewhere that Scala still has some issues with Java
9/10 so that might be hard...

But on that topic, it might be better to shoot for Java 11
compatibility. 9 and 10, following the new release model, aren't
really meant to be long-term releases.

In general, agree with Sean here. Doesn't look like 2.12 support
requires unexpected API breakages. So unless there's a really good
reason to break / remove a bunch of existing APIs...

On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
> Hi all,
>
> I also agree with Mark that we should add Java 9/10 support to an eventual
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
> are using some internal APIs for the memory management which changed: either
> we find a solution which works on both (but I am not sure it is feasible) or
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
>
> Thanks,
> Marco
>
>
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
>>
>> As with Sean, I'm not sure that this will require a new major version, but
>> we should also be looking at Java 9 & 10 support -- particularly with regard
>> to their better functionality in a containerized environment (memory limits
>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>> also be looking at using the latest Scala 2.11.x maintenance release in
>> current Spark branches.
>>
>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
>>>
>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:

 The primary motivating factor IMO for a major version bump is to support
 Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
 Similar to Spark 2.0, I think there are also opportunities for other 
 changes
 that we know have been biting us for a long time but can’t be changed in
 feature releases (to be clear, I’m actually not sure they are all good
 ideas, but I’m writing them down as candidates for consideration):
>>>
>>>
>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>> nearly works that way now.
>>>
>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>> and 2.12 anyway; that's never been promised as compatible.
>>>
>>> (Interesting question about what *Java* users should expect; they would
>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>
>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Java 9/10 support would be great to add as well.

Regarding Scala 2.12, I thought that supporting it would become easier if we 
change the Spark API and ABI slightly. Basically, it is of course possible to 
create an alternate source tree today, but it might be possible to share the 
same source files if we tweak some small things in the methods that are 
overloaded across Scala and Java. I don’t remember the exact details, but the 
idea was to reduce the total maintenance work needed at the cost of requiring 
users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as 
well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency 
shading (a major pain point for lots of users). It’s also a chance to highlight 
Kubernetes, continuous processing and other features more if they become “GA".

Matei

> On Apr 5, 2018, at 9:04 AM, Marco Gaido  wrote:
> 
> Hi all,
> 
> I also agree with Mark that we should add Java 9/10 support to an eventual 
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we 
> are using some internal APIs for the memory management which changed: either 
> we find a solution which works on both (but I am not sure it is feasible) or 
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
> 
> Thanks,
> Marco
> 
> 
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
> As with Sean, I'm not sure that this will require a new major version, but we 
> should also be looking at Java 9 & 10 support -- particularly with regard to 
> their better functionality in a containerized environment (memory limits from 
> cgroups, not sysconf; support for cpusets). In that regard, we should also be 
> looking at using the latest Scala 2.11.x maintenance release in current Spark 
> branches.
> 
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
> The primary motivating factor IMO for a major version bump is to support 
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
> Similar to Spark 2.0, I think there are also opportunities for other changes 
> that we know have been biting us for a long time but can’t be changed in 
> feature releases (to be clear, I’m actually not sure they are all good ideas, 
> but I’m writing them down as candidates for consideration):
> 
> IIRC from looking at this, it is possible to support 2.11 and 2.12 
> simultaneously. The cross-build already works now in 2.3.0. Barring some big 
> change needed to get 2.12 fully working -- and that may be the case -- it 
> nearly works that way now.
> 
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in 
> byte code. However Scala itself isn't mutually compatible between 2.11 and 
> 2.12 anyway; that's never been promised as compatible.
> 
> (Interesting question about what *Java* users should expect; they would see a 
> difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
> 
> I don't disagree with shooting for Spark 3.0, just saying I don't know if 
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 
> 2.11 support if needed to make supporting 2.12 less painful.
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Marco Gaido
Hi all,

I also agree with Mark that we should add Java 9/10 support to an eventual
Spark 3.0 release, because supporting Java 9 is not a trivial task since we
are using some internal APIs for the memory management which changed:
either we find a solution which works on both (but I am not sure it is
feasible) or we have to switch between 2 implementations according to the
Java version.
So I'd rather avoid doing this in a non-major release.

Thanks,
Marco


2018-04-05 17:35 GMT+02:00 Mark Hamstra :

> As with Sean, I'm not sure that this will require a new major version, but
> we should also be looking at Java 9 & 10 support -- particularly with
> regard to their better functionality in a containerized environment (memory
> limits from cgroups, not sysconf; support for cpusets). In that regard, we
> should also be looking at using the latest Scala 2.11.x maintenance release
> in current Spark branches.
>
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
>
>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>>
>>> The primary motivating factor IMO for a major version bump is to support
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>> Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>
>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>> simultaneously. The cross-build already works now in 2.3.0. Barring some
>> big change needed to get 2.12 fully working -- and that may be the case --
>> it nearly works that way now.
>>
>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>> in byte code. However Scala itself isn't mutually compatible between 2.11
>> and 2.12 anyway; that's never been promised as compatible.
>>
>> (Interesting question about what *Java* users should expect; they would
>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>
>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>
>


Re: time for Apache Spark 3.0?

2018-04-05 Thread Mark Hamstra
As with Sean, I'm not sure that this will require a new major version, but
we should also be looking at Java 9 & 10 support -- particularly with
regard to their better functionality in a containerized environment (memory
limits from cgroups, not sysconf; support for cpusets). In that regard, we
should also be looking at using the latest Scala 2.11.x maintenance release
in current Spark branches.

On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:

> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>
> IIRC from looking at this, it is possible to support 2.11 and 2.12
> simultaneously. The cross-build already works now in 2.3.0. Barring some
> big change needed to get 2.12 fully working -- and that may be the case --
> it nearly works that way now.
>
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
> byte code. However Scala itself isn't mutually compatible between 2.11 and
> 2.12 anyway; that's never been promised as compatible.
>
> (Interesting question about what *Java* users should expect; they would
> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>
> I don't disagree with shooting for Spark 3.0, just saying I don't know if
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
> 2.11 support if needed to make supporting 2.12 less painful.
>


Re: time for Apache Spark 3.0?

2018-04-05 Thread Sean Owen
On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:

> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>

IIRC from looking at this, it is possible to support 2.11 and 2.12
simultaneously. The cross-build already works now in 2.3.0. Barring some
big change needed to get 2.12 fully working -- and that may be the case --
it nearly works that way now.

Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
byte code. However Scala itself isn't mutually compatible between 2.11 and
2.12 anyway; that's never been promised as compatible.

(Interesting question about what *Java* users should expect; they would see
a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)

I don't disagree with shooting for Spark 3.0, just saying I don't know if
2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
2.11 support if needed to make supporting 2.12 less painful.


Re: Best way to Hive to Spark migration

2018-04-05 Thread Jörn Franke
And the usual hint when migrating - do not migrate only but also optimize the 
ETL process design - this brings the most benefit s

> On 5. Apr 2018, at 08:18, Jörn Franke  wrote:
> 
> Ok this is not much detail, but you are probably best off if you migrate them 
> to SparkSQL.
> 
> Depends also on the Hive version and Spark version. If you have a recent one 
> with TEZ+llap I would not expect so much difference. It can be also less 
> performant -Spark SQL got only recently some features suchst cost based 
> optimizer.
> 
>> On 5. Apr 2018, at 08:02, Pralabh Kumar  wrote:
>> 
>> Hi 
>> 
>> I have lot of ETL jobs (complex ones) , since they are SLA critical , I am 
>> planning them to migrate to spark.
>> 
>>> On Thu, Apr 5, 2018 at 10:46 AM, Jörn Franke  wrote:
>>> You need to provide more context on what you do currently in Hive and what 
>>> do you expect from the migration.
>>> 
 On 5. Apr 2018, at 05:43, Pralabh Kumar  wrote:
 
 Hi Spark group
 
 What's the best way to Migrate Hive to Spark
 
 1) Use HiveContext of Spark
 2) Use Hive on Spark 
 (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
 3) Migrate Hive to Calcite to Spark SQL
 
 
 Regards
 
>> 


Re: Best way to Hive to Spark migration

2018-04-05 Thread Jörn Franke
Ok this is not much detail, but you are probably best off if you migrate them 
to SparkSQL.

Depends also on the Hive version and Spark version. If you have a recent one 
with TEZ+llap I would not expect so much difference. It can be also less 
performant -Spark SQL got only recently some features suchst cost based 
optimizer.

> On 5. Apr 2018, at 08:02, Pralabh Kumar  wrote:
> 
> Hi 
> 
> I have lot of ETL jobs (complex ones) , since they are SLA critical , I am 
> planning them to migrate to spark.
> 
>> On Thu, Apr 5, 2018 at 10:46 AM, Jörn Franke  wrote:
>> You need to provide more context on what you do currently in Hive and what 
>> do you expect from the migration.
>> 
>>> On 5. Apr 2018, at 05:43, Pralabh Kumar  wrote:
>>> 
>>> Hi Spark group
>>> 
>>> What's the best way to Migrate Hive to Spark
>>> 
>>> 1) Use HiveContext of Spark
>>> 2) Use Hive on Spark 
>>> (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
>>> 3) Migrate Hive to Calcite to Spark SQL
>>> 
>>> 
>>> Regards
>>> 
> 


Re: Best way to Hive to Spark migration

2018-04-05 Thread Pralabh Kumar
Hi

I have lot of ETL jobs (complex ones) , since they are SLA critical , I am
planning them to migrate to spark.

On Thu, Apr 5, 2018 at 10:46 AM, Jörn Franke  wrote:

> You need to provide more context on what you do currently in Hive and what
> do you expect from the migration.
>
> On 5. Apr 2018, at 05:43, Pralabh Kumar  wrote:
>
> Hi Spark group
>
> What's the best way to Migrate Hive to Spark
>
> 1) Use HiveContext of Spark
> 2) Use Hive on Spark (https://cwiki.apache.org/
> confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> 3) Migrate Hive to Calcite to Spark SQL
>
>
> Regards
>
>