Re: [DISCUSS] Spark Columnar Processing

2019-04-05 Thread Bobby Evans
I just filed SPARK-27396 as the SPIP for this proposal.  Please use that
JIRA for further discussions.

Thanks for all of the feedback,

Bobby

On Wed, Apr 3, 2019 at 7:15 PM Bobby Evans  wrote:

> I am still working on the SPIP and should get it up in the next few days.
> I have the basic text more or less ready, but I want to get a high-level
> API concept ready too just to have something more concrete.  I have not
> really done much with contributing new features to spark so I am not sure
> where a design document really fits in here because from
> http://spark.apache.org/improvement-proposals.html and
> http://spark.apache.org/contributing.html it does not mention a design
> anywhere.  I am happy to put one up, but I was hoping the API concept would
> cover most of that.
>
> Thanks,
>
> Bobby
>
> On Tue, Apr 2, 2019 at 9:16 PM Renjie Liu  wrote:
>
>> Hi, Bobby:
>> Do you have design doc? I'm also interested in this topic and want to
>> help contribute.
>>
>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans  wrote:
>>
>>> Thanks to everyone for the feedback.
>>>
>>> Overall the feedback has been really positive for exposing columnar as a
>>> processing option to users.  I'll write up a SPIP on the proposed changes
>>> to support columnar processing (not necessarily implement it) and then ping
>>> the list again for more feedback and discussion.
>>>
>>> Thanks again,
>>>
>>> Bobby
>>>
>>> On Mon, Apr 1, 2019 at 5:09 PM Reynold Xin  wrote:
>>>
 I just realized I didn't make it very clear my stance here ... here's
 another try:

 I think it's a no brainer to have a good columnar UDF interface. This
 would facilitate a lot of high performance applications, e.g. GPU-based
 accelerations for machine learning algorithms.

 On rewriting the entire internals of Spark SQL to leverage columnar
 processing, I don't see enough evidence to suggest that's a good idea yet.




 On Wed, Mar 27, 2019 at 8:10 AM, Bobby Evans  wrote:

> Kazuaki Ishizaki,
>
> Yes, ColumnarBatchScan does provide a framework for doing code
> generation for the processing of columnar data.  I have to admit that I
> don't have a deep understanding of the code generation piece, so if I get
> something wrong please correct me.  From what I had seen only input 
> formats
> currently inherent from ColumnarBatchScan, and from comments in the trait
>
>   /**
>* Generate [[ColumnVector]] expressions for our parent to consume
> as rows.
>* This is called once per [[ColumnarBatch]].
>*/
>
> https://github.com/apache/spark/blob/956b52b1670985a67e49b938ac1499ae65c79f6e/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L42-L43
>
> It appears that ColumnarBatchScan is really only intended to pull out
> the data from the batch, and not to process that data in a columnar
> fashion.  The Loading stage that you mentioned.
>
> > The SIMDzation or GPUization capability depends on a compiler that
> translates native code from the code generated by the whole-stage codegen.
> To be able to support vectorized processing Hive stayed with pure java
> and let the JVM detect and do the SIMDzation of the code.  To make that
> happen they created loops to go through each element in a column and 
> remove
> all conditionals from the body of the loops.  To the best of my knowledge
> that would still require a separate code path like I am proposing to make
> the different processing phases generate code that the JVM can compile 
> down
> to SIMD instructions.  The generated code is full of null checks for each
> element which would prevent the operations we want.  Also, the 
> intermediate
> results are often stored in UnsafeRow instances.  This is really fast for
> row-based processing, but the complexity of how they work I believe would
> prevent the JVM from being able to vectorize the processing.  If you have 
> a
> better way to take java code and vectorize it we should put it into 
> OpenJDK
> instead of spark so everyone can benefit from it.
>
> Trying to compile directly from generated java code to something a GPU
> can process is something we are tackling but we decided to go a different
> route from what you proposed.  From talking with several compiler experts
> here at NVIDIA my understanding is that IBM in partnership with NVIDIA
> attempted in the past to extend the JVM to run at least partially on GPUs,
> but it was really difficult to get right, especially with how java does
> memory management and memory layout.
>
> To avoid that complexity we decided to split the JITing up into two
> separate pieces.  I didn't mention any of this before because this
> discussion was intended to just be around the memory layout support, and
> not GPU processing.  The 

Re: Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread Sean Owen
Hadoop 3 isn't supported yet, not quite even in master. I think the
profile there exists for testing at the moment.
Others may know a way that it can work but don't think it would out of the box.

On Fri, Apr 5, 2019 at 12:53 PM akirillov  wrote:
>
> Hi there! I'm trying to run Spark unit tests with the following profiles:
>
> And 'core' module fails with the following test failing with
> NoClassDefFoundError:
>
> In the meantime building a distribution works fine when running:
>
> Also, there are no problems with running tests using Hadoop 2.7 profile.
> Does this issue look familiar? Any help appreciated!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread Marcelo Vanzin
You can always try. But Hadoop 3 is not yet supported by Spark.

On Fri, Apr 5, 2019 at 11:13 AM Anton Kirillov
 wrote:
>
> Marcelo, Sean, thanks for the clarification. So in order to support Hadoop 3+ 
> the preferred way would be to use Hadoop-free builds and provide Hadoop 
> dependencies in the classpath, is that correct?
>
> On Fri, Apr 5, 2019 at 10:57 AM Marcelo Vanzin  wrote:
>>
>> The hadoop-3 profile doesn't really work yet, not even on master.
>> That's being worked on still.
>>
>> On Fri, Apr 5, 2019 at 10:53 AM akirillov  
>> wrote:
>> >
>> > Hi there! I'm trying to run Spark unit tests with the following profiles:
>> >
>> > And 'core' module fails with the following test failing with
>> > NoClassDefFoundError:
>> >
>> > In the meantime building a distribution works fine when running:
>> >
>> > Also, there are no problems with running tests using Hadoop 2.7 profile.
>> > Does this issue look familiar? Any help appreciated!
>> >
>> >
>> >
>> > --
>> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread Sean Owen
Yes, you can try it, though I doubt that will 100% work. Have a look
at the "hadoop 3" JIRAs and PRs still in progress on master.

On Fri, Apr 5, 2019 at 1:14 PM Anton Kirillov
 wrote:
>
> Marcelo, Sean, thanks for the clarification. So in order to support Hadoop 3+ 
> the preferred way would be to use Hadoop-free builds and provide Hadoop 
> dependencies in the classpath, is that correct?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread Anton Kirillov
Marcelo, Sean, thanks for the clarification. So in order to support Hadoop
3+ the preferred way would be to use Hadoop-free builds and provide Hadoop
dependencies in the classpath, is that correct?

On Fri, Apr 5, 2019 at 10:57 AM Marcelo Vanzin  wrote:

> The hadoop-3 profile doesn't really work yet, not even on master.
> That's being worked on still.
>
> On Fri, Apr 5, 2019 at 10:53 AM akirillov 
> wrote:
> >
> > Hi there! I'm trying to run Spark unit tests with the following profiles:
> >
> > And 'core' module fails with the following test failing with
> > NoClassDefFoundError:
> >
> > In the meantime building a distribution works fine when running:
> >
> > Also, there are no problems with running tests using Hadoop 2.7 profile.
> > Does this issue look familiar? Any help appreciated!
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Marcelo
>


Re: Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread Marcelo Vanzin
The hadoop-3 profile doesn't really work yet, not even on master.
That's being worked on still.

On Fri, Apr 5, 2019 at 10:53 AM akirillov  wrote:
>
> Hi there! I'm trying to run Spark unit tests with the following profiles:
>
> And 'core' module fails with the following test failing with
> NoClassDefFoundError:
>
> In the meantime building a distribution works fine when running:
>
> Also, there are no problems with running tests using Hadoop 2.7 profile.
> Does this issue look familiar? Any help appreciated!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread Anton Kirillov
Really sorry for the formatting. Here's the original message:

Hi there! I'm trying to run Spark unit tests with the following profiles:

./build/mvn test -Pmesos "-Phadoop-3.1" -Pnetlib-lgpl -Psparkr -Phive
-Phive-thriftserver

And 'core' module fails with the following test failing with
NoClassDefFoundError:

HadoopDelegationTokenManagerSuite:
- Correctly load default credential providers
- disable hive credential provider
- using deprecated configurations
- verify no credentials are obtained
*** RUN ABORTED ***
  java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.hive.conf.HiveConf
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:250)
  at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:173)
  at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
  at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$2.apply$mcV$sp(HiveDelegationTokenProvider.scala:114)
  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
  at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider.obtainDelegationTokens(HiveDelegationTokenProvider.scala:113)
  at 
org.apache.spark.deploy.security.HadoopDelegationTokenManagerSuite$$anonfun$5.apply(HadoopDelegationTokenManagerSuite.scala:98)
  at 
org.apache.spark.deploy.security.HadoopDelegationTokenManagerSuite$$anonfun$5.apply(HadoopDelegationTokenManagerSuite.scala:90)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)

In the meantime building a distribution works fine when running:

./dev/make-distribution.sh --tgz -Pmesos "-Phadoop-3.1" -Pnetlib-lgpl
-Psparkr -Phive -Phive-thriftserver -DskipTests

Also, there are no problems with running tests using Hadoop 2.7 profile.
Does this issue look familiar? Any help appreciated!

On Fri, Apr 5, 2019 at 10:53 AM akirillov 
wrote:

> Hi there! I'm trying to run Spark unit tests with the following profiles:
>
> And 'core' module fails with the following test failing with
> NoClassDefFoundError:
>
> In the meantime building a distribution works fine when running:
>
> Also, there are no problems with running tests using Hadoop 2.7 profile.
> Does this issue look familiar? Any help appreciated!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Spark 2.4.0 tests fail with hadoop-3.1 profile: NoClassDefFoundError org.apache.hadoop.hive.conf.HiveConf

2019-04-05 Thread akirillov
Hi there! I'm trying to run Spark unit tests with the following profiles: 

And 'core' module fails with the following test failing with
NoClassDefFoundError: 

In the meantime building a distribution works fine when running: 

Also, there are no problems with running tests using Hadoop 2.7 profile.
Does this issue look familiar? Any help appreciated!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
On Fri, Apr 5, 2019 at 9:41 AM Jungtaek Lim  wrote:
>
> Thanks Andrew for reporting this. I just submitted the fix. 
> https://github.com/apache/spark/pull/24304

Thanks!

>
> On Fri, Apr 5, 2019 at 3:21 PM Andrew Melo  wrote:
>>
>> Hello,
>>
>> I'm not sure if this is the proper place to report it, but the 2.4.1
>> version of the config docs apparently didn't render right into HTML
>> (scroll down to "Compression and Serialization")
>>
>> https://spark.apache.org/docs/2.4.1/configuration.html#available-properties
>>
>> By comparison, the 2.4.0 version of the docs renders correctly.
>>
>> Cheers
>> Andrew
>>
>> On Fri, Apr 5, 2019 at 7:59 AM DB Tsai  wrote:
>> >
>> > +user list
>> >
>> > We are happy to announce the availability of Spark 2.4.1!
>> >
>> > Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
>> > maintenance branch of Spark. We strongly recommend all 2.4.0 users to
>> > upgrade to this stable release.
>> >
>> > In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer
>> > experimental.
>> > We will drop Scala 2.11 support in Spark 3.0, so please provide us 
>> > feedback.
>> >
>> > To download Spark 2.4.1, head over to the download page:
>> > http://spark.apache.org/downloads.html
>> >
>> > To view the release notes:
>> > https://spark.apache.org/releases/spark-release-2-4-1.html
>> >
>> > One more thing: to add a little color to this release, it's the
>> > largest RC ever (RC9)!
>> > We tried to incorporate many critical fixes at the last minute, and
>> > hope you all enjoy it.
>> >
>> > We would like to acknowledge all community members for contributing to
>> > this release. This release would not have been possible without you.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Jungtaek Lim
Thanks Andrew for reporting this. I just submitted the fix.
https://github.com/apache/spark/pull/24304

On Fri, Apr 5, 2019 at 3:21 PM Andrew Melo  wrote:

> Hello,
>
> I'm not sure if this is the proper place to report it, but the 2.4.1
> version of the config docs apparently didn't render right into HTML
> (scroll down to "Compression and Serialization")
>
> https://spark.apache.org/docs/2.4.1/configuration.html#available-properties
>
> By comparison, the 2.4.0 version of the docs renders correctly.
>
> Cheers
> Andrew
>
> On Fri, Apr 5, 2019 at 7:59 AM DB Tsai  wrote:
> >
> > +user list
> >
> > We are happy to announce the availability of Spark 2.4.1!
> >
> > Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
> > maintenance branch of Spark. We strongly recommend all 2.4.0 users to
> > upgrade to this stable release.
> >
> > In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer
> > experimental.
> > We will drop Scala 2.11 support in Spark 3.0, so please provide us
> feedback.
> >
> > To download Spark 2.4.1, head over to the download page:
> > http://spark.apache.org/downloads.html
> >
> > To view the release notes:
> > https://spark.apache.org/releases/spark-release-2-4-1.html
> >
> > One more thing: to add a little color to this release, it's the
> > largest RC ever (RC9)!
> > We tried to incorporate many critical fixes at the last minute, and
> > hope you all enjoy it.
> >
> > We would like to acknowledge all community members for contributing to
> > this release. This release would not have been possible without you.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
Hello,

I'm not sure if this is the proper place to report it, but the 2.4.1
version of the config docs apparently didn't render right into HTML
(scroll down to "Compression and Serialization")

https://spark.apache.org/docs/2.4.1/configuration.html#available-properties

By comparison, the 2.4.0 version of the docs renders correctly.

Cheers
Andrew

On Fri, Apr 5, 2019 at 7:59 AM DB Tsai  wrote:
>
> +user list
>
> We are happy to announce the availability of Spark 2.4.1!
>
> Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
> maintenance branch of Spark. We strongly recommend all 2.4.0 users to
> upgrade to this stable release.
>
> In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer
> experimental.
> We will drop Scala 2.11 support in Spark 3.0, so please provide us feedback.
>
> To download Spark 2.4.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-1.html
>
> One more thing: to add a little color to this release, it's the
> largest RC ever (RC9)!
> We tried to incorporate many critical fixes at the last minute, and
> hope you all enjoy it.
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread DB Tsai
+user list

We are happy to announce the availability of Spark 2.4.1!

Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
maintenance branch of Spark. We strongly recommend all 2.4.0 users to
upgrade to this stable release.

In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer
experimental.
We will drop Scala 2.11 support in Spark 3.0, so please provide us feedback.

To download Spark 2.4.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-1.html

One more thing: to add a little color to this release, it's the
largest RC ever (RC9)!
We tried to incorporate many critical fixes at the last minute, and
hope you all enjoy it.

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org